问题:
I have generated a view from a table in PostgreSQL consisting of 50,000 rows. I want to take a random sample from this view based on a number of conditions. I understand th...
可以将文章内容翻译成中文,广告屏蔽插件会导致该功能失效:
问题:
I have generated a view from a table in PostgreSQL consisting of 50,000 rows. I want to take a random sample from this view based on a number of conditions. I understand this can be done in the following way:
select * from viewname
where columnname = 'A' -- the condition
order by columnname
limit 5;
However, instead of 'limit 5', I want to take a percentage of the number of rows which meet this condition. So for instance, 'limit 5%' (though this is not correct syntax). I understand a similar thing can be done with the tablesample clause but this does not apply to views.
回答1:
You could use the window function PERCENT_RANK
SELECT *
FROM
(
select *, PERCENT_RANK() OVER (PARTITION BY columnname ORDER BY random()) AS pcrnk
from tablename
where columnname = 'A'
) q
WHERE pcrnk <= 0.05
And if you don't want to see that pcrnk in the result?
SELECT (t).*
FROM
(
select t, PERCENT_RANK() OVER (PARTITION BY columnname ORDER BY random()) AS pcrnk
from tablename t
where columnname = 'A'
) q
WHERE pcrnk <= 0.05
Test on db<>fiddle here
These queries will retrieve 5% of what be retrieved normally based on the criteria columnname = 'A'
F.e. if there are 100 'A' and 1000 'B', then they return 5 records.
If you want to return 5% of all the records in the table? Then here's another trick.
select *
from tablename
where columnname = 'A'
order by random()
limit 0.05 * (select count(*) from tablename)
回答2:
In order to randomly select a percentage of your rows, and if you have Postgres 9.5 or higher, have a look at Postgres TABLESAMPLE.
It has two options : BERNOULLI
and SYSTEM
:
The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. [...] These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table's rows.
SYSTEM
is faster, but BERNOULLI
gives better random distribution because each record has the same probability on being selected.
SELECT *
FROM tablename TABLESAMPLE SYSTEM(5)
WHERE columnname = 'A' -- the condition
ORDER BY columnname;
NB : this only works if you are querying a table, and not for views.