To GIN or not?!

I’ve been tinkering around with PostgreSQL off-late while working on a project, using Laravel as my framework of choice. Its been many years since I looked into PostgreSQL, enough to say that I might as well be starting fresh.

The project requires storing some unstructured data which meant that I’ll have to store that as JSON in a table column. Now I could have gone with MySQL or MariaDB; while both of these do have a JSON data type, its not great when it comes to querying and performance. And that led me to thinking that maybe I should instead go with PostgreSQL on this one.

I last played around with PostgreSQL more than 10-12 years ago, so I thought I’ll tinker around, get a feel for it & test out the things that I’ll be using. That led me to the JSON & JSONB data types – the former stores JSON as text with white space between tokens & duplicate keys preserved while the latter stores JSON in compressed binary format (if there are duplicate keys then last value is the one which is kept) which results in better query performance. A more detailed difference is provided in the documentation:

The major practical difference is one of efficiency. The JSON data type stores an exact copy of the input text, which processing functions must reparse on each execution; while JSONB data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. JSONB also supports indexing, which can be a significant advantage.

Because the JSON type stores an exact copy of the input text, it will preserve semantically-insignificant white space between tokens, as well as the order of keys within JSON objects. Also, if a JSON object within the value contains the same key more than once, all the key/value pairs are kept. (The processing functions consider the last value as the operative one.) By contrast, JSONB does not preserve white space, does not preserve the order of object keys, and does not keep duplicate object keys. If duplicate keys are specified in the input, only the last value is kept.

Set up for the tests

Laravel supports PostgreSQL out of the box but that support is rather basic as in it does not go deep into the stuff that PostgreSQL supports. So while support for creating JSONB columns is there, it does not support specification of the algorithm to use when creating a full text index like GIN, BRIN, etc. The full text indexes it creates in PostgreSQL use the GIN algorithm but apparently there is no way to enable or disable fastupdate (none that I could find).

To add support for PostgreSQL specific features, I installed the tpetry/laravel-postgresql-enhanced package. To specify the algorithm to use with the Full-text index, Blueprint and Schema classes included in tpetry/laravel-postgresql-enhanced package need to be used in the migration(s) in place of default Laravel versions.

To check out JSON querying capabilities, index performance, etc., I setup a fresh install and created a migration for the table cars as following:

		
			use Illuminate\Database\Migrations\Migration;
use Tpetry\PostgresqlEnhanced\Schema\Blueprint;
use Tpetry\PostgresqlEnhanced\Support\Facades\Schema;

return new class extends Migration {
    /**
     * Run the migrations.
     */
    public function up() : void
    {
        Schema::create( 'cars', function ( Blueprint $table ) {
            $table->id();
            $table->string( 'brand' );
            $table->string( 'type' );
            $table->string( 'model' );
            $table->string( 'number' );
            $table->json( 'details' )->nullable();
            $table->timestamps();

            $table->index( 'brand' );
            $table->index( 'type' );
            $table->index( 'model' );
            $table->index( 'number' );
            $table->fullText( 'details' )->algorithm( 'btree' );
        } );
    }

    /**
     * Reverse the migrations.
     */
    public function down() : void
    {
        Schema::dropIfExists( 'cars' );
    }
};

And then I set up the Car model & cast details column as JSON.

		
			class Car extends Model {
    use HasFactory;

    protected $guarded = [];

    protected function casts() : array
    {
        return [
            'details' => 'json',
        ];
    }
}

After this I created the model factory for Car and a database seeder to seed database with 10,000 random records. Here is an example record which also shows the data that is being stored as JSON in the details column.

		
			[
    'id'      => 11,
    'brand'   => 'Ferrari',
    'model'   => 'Purosangue',
    'number'  => 'HDGW17JXOL8UDNPBSD4A',
    'speed'   => '201',
    'details' => [
        'attributes'        => [
            'speed'        => 201,
            'colour'       => 'green',
            'transmission' => 'automatic',
        ],
        'ownership-history' => [
            [
                'owner'         => 'Justina Sauer',
                'sale-date'     => '2022-07-05',
                'purchase-date' => '2021-02-05',
            ],
            [
                'owner'         => 'Ahmad Collier V',
                'sale-date'     => '2023-11-05',
                'purchase-date' => '2022-07-05',
            ],
            [
                'owner'         => 'Margarete Bailey',
                'sale-date'     => '2024-12-05',
                'purchase-date' => '2023-11-05',
            ],
            [
                'owner'         => 'Prof. Vita McClure',
                'sale-date'     => null,
                'purchase-date' => '2024-12-05',
            ],
        ],
    ],
]

All this data is randomly generated for each record by the model factory and ownership-history is random in length and so are number of owners for each car.

And then this is the code I used on all iterations to measure query performance:

		
			use App\Models\Car;
use Illuminate\Support\Benchmark;

$limit = 50;
$colour = 'green';

$labelA = 'Normal Search';
$labelB = 'Full-text Search';

$searchA = fn () => Car::where( 'details->attributes->colour', $colour )
                        ->select( 'id', 'brand', 'model', 'number', 'details->attributes->speed as speed' )
                        ->orderBy( 'id', 'asc' )
                        ->limit( $limit )
                        ->get()->toArray();

$searchB = fn () => Car::whereFullText( 'details->attributes->colour', $colour )
                        ->select( 'id', 'brand', 'model', 'number', 'details->attributes->speed as speed' )
                        ->orderBy( 'id', 'asc' )
                        ->limit( $limit )
                        ->get()->toArray();

[$valueA, $durationA] = Benchmark::value( $searchA );
[$valueB, $durationB] = Benchmark::value( $searchB );

printf(
    'Searched for "%1$s" coloured cars in the database.'.PHP_EOL,
    $colour
);

printf( 'Total %d records found.'.PHP_EOL.PHP_EOL, count( $valueA ) );

printf( '"%s" took %f ms to complete.' . PHP_EOL, $labelA, $durationA );
printf( '"%s" took %f ms to complete.' . PHP_EOL, $labelB, $durationB );

printf(
    'Results provided by both "%1$s" & "%2$s" are %3$s.'. PHP_EOL,
    $labelA,
    $labelB,
    ( $valueA === $valueB ) ? 'same' : 'NOT same'
);

Running the tests

I ran the test code and this is what I got (on multiple iterations with negligible difference on each iteration).

		
			Data-type: JSON
Index: Full-text B-Tree

Searched for "green" coloured cars in the database.
Total 50 records found.

"Normal Search" took 101.057646 ms to complete.
"Full-text Search" took 40.883731 ms to complete.
Results provided by both "Normal Search" & "Full-text Search" are same.

I then altered the index to use GIN index instead of the B-Tree index.

		
			//$table->fullText( 'details' )->algorithm( 'btree' );
$table->fullText( 'details' )->algorithm( 'gin' )->with( [ 'fastupdate' => false ] );

Running the test code again yielded the following result (pretty much similar across multiple iterations):

		
			Data-type: JSON
Index: Full-text GIN

Searched for "green" coloured cars in the database.
Total 50 records found.

"Normal Search" took 95.817358 ms to complete.
"Full-text Search" took 39.838098 ms to complete.
Results provided by both "Normal Search" & "Full-text Search" are same.

It was not a big improvement. Now it was time to try this out with the JSONB data type.

After altering the data-type of the details column to JSONB, I first ran the test code with the Full-text index created using B-Tree and then using GIN.

		
			Data-type: JSONB
Index: Full-text B-Tree

Searched for "green" coloured cars in the database.
Total 50 records found.

"Normal Search" took 121.289042 ms to complete.
"Full-text Search" took 20.185935 ms to complete.
Results provided by both "Normal Search" & "Full-text Search" are same.

		
			Data-type: JSONB
Index: Full-text GIN

Searched for "green" coloured cars in the database.
Total 50 records found.

"Normal Search" took 76.028138 ms to complete.
"Full-text Search" took 18.994181 ms to complete.
Results provided by both "Normal Search" & "Full-text Search" are same.

The results showed a significant improvement over the JSON data-type with query time reduced to half on JSONB data-type. Using a GIN index shows (marginal) performance gain over a B-Tree index. As per the PostgreSQL documentation & people who know about indexing (way more than I do), a GIN index is the way to go for a JSONB column as its more efficient when searching for keys or key/value pairs within JSON data. So one might wonder, is there a catch with GIN?!

An update

I showed this & discussed with some other folk who are much more knowledgeable than me in the matters of PostgreSQL & Laravel. Tobias Petry, the creator of tpetry/laravel-postgresql-enhanced package, had a suggestion on this. Instead of going Full-text on details column, he suggested I do this:

		
			$table->index( 'details jsonb_path_ops' )->algorithm( 'gin' );

And then @> operator should be used to check for the sub-node and its value in attributes like this:

		
			$query->where(
    'details',
    '@>',
    json_encode( [ 'attributes' => [ 'colour' => $colour ] ] )
)

which will change our full code for indexed query to this:

		
			$searchB = fn () => Car::where( 'details', '@>', json_encode( [ 'attributes' => [ 'colour' => $colour ] ] ) )
                        ->select( 'id', 'brand', 'model', 'number', 'details->attributes->speed as speed' )
                        ->orderBy( 'id', 'asc' )
                        ->limit( $limit )
                        ->get()->toArray();

Running this now yielded an interesting result.

		
			Data-type: JSONB
Index: Actual GIN Implementation

Searched for "green" coloured cars in the database.
Total 50 records found.

"Normal Search" took 76.028138 ms to complete.
"GIN jsonb-path Search" took 6.647964 ms to complete.
Results provided by both "Normal Search" & "GIN jsonb-path Search" are same.

Now that is really interesting. One might ask what the hell just happened!!

It turns out that GIN supports pattern matching but by default its turned ON only for tsvector values and not for json like I was trying to do here. To make GIN applicable for the search here, it had to be hooked up for jsonb path operators which are @>, @? & @@. Another option is to use jsonb_ops which supports more operators but as per PostgreSQL docs, jsonb_path_ops offers a bit better performance. I’ll need to check up on that & do comparative tests, but that’s for another time.

Data indexing & the catch with GIN indexes!

Indexing data is a requirement if results are expected reasonably fast and not in an eternity. In addition to B-Tree indexes, PostgreSQL supports Hash, GiST, SP-GiST, GIN & BRIN indexes. For our purpose here, GIN is more suited, as per this explanation from the PostgreSQL documentation:

GIN indexes are “inverted indexes” which are appropriate for data values that contain multiple component values, such as arrays. An inverted index contains a separate entry for each component value, and can efficiently handle queries that test for the presence of specific component values.

But GIN indexes are not magic bullets and they have an Achilles’ heel. They are slow on data inserts and updates. This is why PostgreSQL turns the fastupdate flag ON by default for all GIN indexes. When that flag is enabled, PostgreSQL does not update that GIN index immediately when a record is inserted or updated. Rather, its kept in a buffer & when that buffer’s limit is reached (4 MB by default for gin_pending_list_limit), then PostgreSQL triggers an index update which can take a while depending on the amount of data it has to update and the complexity of the data it is updating.

But why is this done this way? Why not just update index as the data comes in?

Looking at the PostgreSQL documentation for GIN fast update, we can read why that’s the case.

Updating a GIN index tends to be slow because of the intrinsic nature of inverted indexes: inserting or updating one heap row can cause many inserts into the index (one for each key extracted from the indexed item). As of PostgreSQL 8.4, GIN is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. When the table is vacuumed or autoanalyzed, or when gin_clean_pending_list function is called, or if the pending list becomes larger than gin_pending_list_limit, the entries are moved to the main GIN data structure using the same bulk insert techniques used during initial index creation. This greatly improves GIN index update speed, even counting the additional vacuum overhead. Moreover the overhead work can be done by a background process instead of in foreground query processing.

The main disadvantage of this approach is that searches must scan the list of pending entries in addition to searching the regular index, and so a large list of pending entries will slow searches significantly. Another disadvantage is that, while most updates are fast, an update that causes the pending list to become “too large” will incur an immediate cleanup cycle and thus be much slower than other updates.

So the data inserts & updates will be fast, look-ups will be fast but if the buffer limit is reached then the insert/update which triggered it will have to wait in a queue while PostgreSQL cleans up the buffer & updates the GIN index. That can lead to random data inserts/updates being painfully slow as Sergios found out to their dismay when their production environment kept hitting a very slow query for data insert/update seemingly at random.

In their blog post Sergios does an in-depth analysis on this (give it a read, it provides valuable insight) along with benchmarks on different possible solutions. Here is a brief summary:

If fastupdate is enabled then data inserts/updates are fast as long as they are not triggering a clean-up of buffer.
If fastupdate is enabled and a data insert/update triggers the clean-up then that becomes slow, painfully slow – it can take a few seconds (yes, we cross the millisecond thresholds at this point) or a few minutes, depending on the size of gin_pending_list_limit.
If fastupdate is not enabled then data inserts/updates are a bit slower. Not painfully slow but slow.

Do we GIN? Or not?

There are multiple solutions to this problem as proposed by people who have delved into this many times. Here are some of them:

Disable fastupdate to have a predictable & consistent performance.
Set the size of gin_pending_list_limit to be much higher so that clean-ups will not trigger frequently. But it could impact SELECT queries as PostgreSQL will need to look into both the GIN index as well as the buffer.
Set the size of gin_pending_list_limit to be small, like 128 KB or thereabouts, which will trigger clean-ups often but on much smaller buffer sizes resulting in the triggering insert/update query to take a few seconds instead of a few minutes.
Drop the GIN index before doing data insert/update and re-create the index afterwards.

Of the above, the last option is recommended by PostgreSQL documentation as well but its best suited only when bulk inserts/updates are being done because while the index is down, any look-ups on that column will be slow for concurrent requests.

The second option is akin to sweeping the trash under the rug – there’s only so much of it that can go under it. Eventually clean-up of buffer will be triggered and the query that does it will wait for a long time. Still this approach can be considered if data inserts/updates in this case are not done directly via user action but asynchronously by a scheduled task or a queue worker.

First & third options are more viable depending on where & how the data inserts/updates happen.

In my case, I’m going with the first option & disabling fastupdate for now. It isn’t a significant issue at present & will not be for some time. There’s a thing about not optimising pre-maturely. However looking into this & reading up on about GIN index & its Achilles’ heel was very interesting. 🙂

Set up for the tests

Running the tests

An update

Data indexing & the catch with GIN indexes!

Do we GIN? Or not?

Leave a Reply Cancel Reply

Sliding Sidebar