Optimizing Solr Resources with G1

Introduction

Java community is continuously evolving at a very rapid pace. With java 9, it deprecated famous Concurrent Mark Sweep GC Collector, in favor of new G1 collector. Engineers can resist anything except giving their application beefier resources. Especially when it comes to memory hungry Solr, luring us to turn the heap up. This article will be focused on tunning G1 GC params in Solr and tell how blibli.com‘s search gained major performance gains with limited resources by just adding 11 lines of configurations!

‍

G1 Collector Overview

In this section I will cover basic concept of how G1 works and key concepts which are essential while configuring collector on Solr. I won’t cover entire working of G1 but here are few articles to refer [1] and [2]. Moving on, if you are using JDK 9 (in later part, we will discuss why it is not recommended to use JDK 9 and lower, if you are planning to upgrade to G1) and on wards, JVM by default ships with G1 collector rather that Parallel GC. You can also enable G1 collector in JDK 7 update 4 and later by using -XX:+UseG1GC on command line JVM parameters. The key point which makes G1 different from other collectors is the grid style heap layout, heap is equally partitioned into small equally spaced cells of memory each cell can be either free or occupied by young or old generation.

‍

Size of each individual cell can be configured with -XX:G1HeapRegionSize=n JVM parameter, by default the region size is allocated by maxHeapSize/2048 and rounded down to the first power of 2 between 1MB and 32MB; region sizes <1MB or >32MB are not supported.

Humongous Allocation

There can be situation in which an object is bigger than an individual free cell size, and may required multiple contiguous free cells to fit in. These objects are called humongous objects, and these objects play a crucial role in tuning G1 collector on Solr. Conditions that qualify to be humongous is when the object size exceeds 50% of heap region size i.e -XX:G1HeapRegionSize and these are assigned contiguous set of regions. JVM treats such objects bit differently they can be reclaimed during any type of collection cycle (well depends if you are using JDK 8u40 JDK-8048179 or else they will be reclaimed in full garbage collection, here is a good article explaining how eager reclamation works for humongous object [4]).

‍

Heap Region and Generation

Nothing special here it’s just like other collectors. Quoting official documentation from Oracle

The vast majority of objects are allocated in a pool dedicated to young objects (the young generation), and most objects die there. When the young generation fills up, it causes a minor collection in which only the young generation is collected; garbage in other generations isn’t reclaimed. The costs of such collections are, to the first order, proportional to the number of live objects being collected; a young generation full of dead objects is collected very quickly. Typically, some fraction of the surviving objects from the young generation are moved to the old generation during each minor collection. Eventually, the old generation fills up and must be collected, resulting in a major collection, in which the entire heap is collected. Major collections usually last much longer than minor collections because a significantly larger number of objects are involved.

‍

CMS vs. G1 Collector

Applications running today with either the CMS or the ParallelOld garbage collector would benefit switching to G1 if the application has one or more of the following traits.

More than 50% of the Java heap is occupied with live data.
The rate of object allocation rate or promotion varies significantly.
Undesired long garbage collection or compaction pauses (longer than 0.5 to 1 second)

‍

How we optimized Solr with G1

One thing clear out, optimizing garbage collection won’t reduce the actual memory requirement, this will only help save extra CPU cycles while doing garbage collection, that will be useful serving requests. For optimizing memory utilization, with this and optimizing GC you can squeeze out much performance from both CPU and memory front. Moving on, there is no thumb rule to determine the configuration theoretically but it “all depends” on the following factors:

Size of each documents
Number of documents present n given node
Request throughput for query and update requests
How frequent is index commit
Numbers of rows trying to fetch in request.
Fetching all the fields or few fields only

All the factors are interlinked together, one thing or another may alter GC behavior. Lets see the setup that we used to get our ideal GC configuration and few observations.

‍

Experiment Setup

For hardware we had 3 VMs with following configuration:

Operating System: Centos 7.2
Memory: 32g
CPU: Intel x86 16cores, Threads per core: 2

And the Solr configuration we were having was:

Solr v8.3.1
Documents: 5million
Heap configuration: 18g Xms=Xmx
Single shard
Java 8

Each VM running single node with single NRT replica present on each of them as part of single Solr cluster. Further I followed simple “survival of the fittest” to get the final result. We first started with default config (lets say 1st Generation Node 2 β is default configuration) and tinkered (mutated) each parameter at a time i.e 𝛼 and ɣ. Equal load was provided on each node and observed the GC pause, average response time, 95th response time and the request throughput. Best performing setup was used to second generation and repeated the same until we get to the final desired configuration.

‍

‍

Instrumenting the application through the process is important. We used NewRelic to observe Solr node’s heap region usage and pauses for minor as well full GCs accordingly we decided which paramater to update for next iteration. But if you don’t have paid license for NewRelic, in that case Grafana and Prometheus will solve the purpose. Or else turn on the GC logs, by attaching following parameters to GC_TUNE variable of Solr’s solr.in.sh file and use some online tool like https://gceasy.io/ to visualize. GC Easy helped us to understand the GC activity in much more granular way, than even New Relic.

-XX:+PrintGCDetails

-XX:+PrintGCDateStamps

-XX:+PrintGCCause

-XX:+PrintTenuringDistribution

-XX:+UseGCLogFileRotation

-XX:NumberOfGCLogFiles=10

-XX:GCLogFileSize=5M

‍

Observation #1: Query Analysis

Let’s say you are fetching large number of documents (maybe assume rows=100000) but with just fl=id and in another hand you are requesting same rows but with fl=* on separate node, both of them will have very different memory footprint. In Solr every document is an Object and each documents have its own fields that is also Object, so let;s say each document on Solr have 10 String fields. Roughly we can say in the first request i.e rows=100000&fl=id must have created 100000*1 objects and the second request rows=100000&fl=* would have created 100000*10 that is 10x more than the previous and resulting to much aggressive GCs. One thing to note is, such requests on Solr create Objects having very short life span, so as explained earlier if we can fit in those objects in survivor space we can reduce the frequency of full GCs.

‍

Observation #2: Humongous Object

This is the most critical observation out of all. By tuning this parameter alone we reduced our garbage collection time by 20x! On analysis the GC logs we saw most of the time JVM was busy allocating humongous objects.

‍

‍

This was really concerning as majority of objects were humongous object and JVM was really busy cleaning up those objects. As we know the fact that most of the objects on solr have very small life span, some how if we could put those objects in survivor space then it solve our problem. One way is to increase the region size so that each such object can fit inside 50% of region size. So we increased the region size to 16m, with this we reduced old gen usage and started using survivor space and with that we reduced the GC pauses significantly.

‍

‍

‍

Observation #3: JVM Version

G1 is default GC with java 9, however it’s present in Java 8 as well but it was in very early stage and you have to be explicit while defining the same. We experimented against java 8 and java 11 specifically. With out tests Java 11 performed significantly better as compared to java 8. Most of the performance gains were from improved garbage collection, and CPU resources were getting utilized better for serving more requests i.e more throughput and there were less pause i.e lower latency. Quantitatively speaking we saw 50% response time gains on j11 as compared with j8.

Okay so we have upgraded JDK version. Lets see what new stuffs are packed with the version upgrade. We went through entire release note and bug fixes. One particular improvement caught our eye was JEP-307. This is regarding parallelism of major GC fixed in java 10. Here is the thumb rule which we followed to configure the same:

-XX:ParallelGCThreads=n. where n = 5/8*(total CPU threads) this should not cross more than 8

-XX:ConcGCThreads=n where n = ParallelGCThreads/4

Observation #4: Solr Caches and it’s impact on Old gen

As discussed earlier, all the objects on Solr are short lived objects and in “ideal” scenarios it should not move to old gen space. But it’s not the case every time. Solr uses different types of solr caches which have much longer time span. Few things which determine the old gen utilization are as following

Number of caches enabled and cache size: Higher the cache size means more old gen utilization and more aggressive major GCs
Auto warm count: Whenever a new searcher gets open, higher auto warm count means more older objects will get copied, that means less objects getting collected, i.e less aggressive mixed collection
Commit interval: with each commit a new searcher gets opened and the older cache gets dropped and the hottest entries gets copied i.e auto warmed. That means more the commit interval less objects are collected.

By considering the above 3 points and monitoring our cache utilization we configured our old gen space with -XX:G1NewSizePercent=x and -XX:G1MaxNewSizePercent=y. by adjusting out old gen space we could sustain irregegular surge in traffic efficiently by restricting the flow of extra short spanned objects to old gen space.

‍

Conclusion

After just customizing our GC parameter we gained 23% improvement on response time and reduced CPU time for Garbage Collection by 95%. Here is how our response time and throughput stacks up.

‍