Avoid sequential scan in PostgreSQL link table with highly variant density

I had a particularly knotty problem today. I have two tables:

data_file (id, filename, source, stuff) extracted_item (id, item_name, stuff)

A data file comes in and we store mention of it in data_file. The file can have zero, or more commonly, some finite positive number of items in it. We extract some data, and store those extracted items in, you guessed it, extracted_item.

There are tens of sources, and over time, tens of thousands of data files processed. No one source accounted for more than, say, 10% of the extracted items.

Now, sometimes the same extracted item appears in more than one file. We don’t want to store it twice, so what we have is the classic “link table,” “junction table,” or “many-to-many table,” thus:

data_file_extracted_item_link (data_file_id, extracted_item_id)

There are of course indices on both data_file_id and extracted_item_id.

Now, most data files have a tiny few items (1 is the modal number of items per file), but a couple of my data sources send files with almost 1 million items per file. Soon my link table grew to over 100 million entries.

When I went to do a metrics query like this:

select count(distinct data_file.filename), count(data_file_extracted_item_link.*) from data_file left join data_file_extracted_item_link on data_file.id=data_file_extracted_item_link.data_file_id where source=$1 and [SOME OTHER CONDITIONS]

I would sometimes get instant (40 ms) responses, and sometimes get minutes-long responses. It depended upon the conditions and the name of the source, or so it seemed.

ANALYZE told me that sometims the Postgresql planner was choosing a sequential scan (seqscan) of the link table, with its 100 million rows. This was absurd, since 1. there were indices available to scan, and 2. no source ever accounted for more than a few percent of the total link table entries.

It got to the point where it was faster by orders of magnitude for me to write two separate queries instead of using a join. And I do mean “write” — I could manually write out a new query and run it in a different psql terminal minutes before Postgres could finish the 100 million + row seqscan.

When I examined pg_stats, I was shocked to find this gem:
select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link'; tablename | attname | null_frac | avg_width | n_distinct | correlation -------------------------------+-------------------+-----------+-----------+------------+------------- data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647 data_file_extracted_item_link | data_file_id | 0 | 33 | 299 | 0.106799

What was going on? Postgres though there were only 299 different data files represented among the 100 million rows. Therefore, when I went to look at perhaps 100 different data files from a source, the query planner sensibly thought I’d be looking at something like a third of the entire link table, and decided a seqscan was the way to go.

It turns out that this is an artifact of the way the n_distinct is estimated. For more on this, see “serious under-estimation of n_distinct for clustered distributions” http://postgresql.nabble.com/serious-under-estimation-of-n-distinct-for-clustered-distributions-td5738290.html

Make sure you have this problem, and then, if you do, you can fix it by issuing two DDL statments (be sure to put these in your DDL / migrations with adequate annotation, and be aware they are PostgreSQL-specific).

First, choose a good number for n_distinct using guidance from http://www.postgresql.org/docs/current/static/sql-altertable.html

(In a nutshell, if you don’t want to be periodically querying and adjusting this with an actual empirical number, you can choose a negative number from (-1, 0] to force the planner to guess that the sparsity is abs(number), such that -1 => 100% sparsity.)

Then, you can simply

alter table data_file_extracted_item_link alter column data_file_id set (n_distinct = -0.5); analyze data_file_extracted_item_link;

After which, things are better:

select tablename, attname, null_frac, avg_width, n_distinct, correlation from pg_stats where tablename='data_file_extracted_item_link'; tablename | attname | null_frac | avg_width | n_distinct | correlation -------------------------------+-------------------+-----------+-----------+------------+------------- data_file_extracted_item_link | extracted_item_id | 0 | 33 | 838778 | -0.0032647 data_file_extracted_item_link | data_file_id | 0 | 33 | -0.5 | 0.098922

and no more grody seqscan.

Avoid sequential scan in PostgreSQL link table with highly variant density

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112