Selecting a Random Sample from a View in Postgresql

问题: I have generated a view from a table in PostgreSQL consisting of 50,000 rows. I want to take a random sample from this view based on a number of conditions. I understand th...

问题:

I have generated a view from a table in PostgreSQL consisting of 50,000 rows. I want to take a random sample from this view based on a number of conditions. I understand this can be done in the following way:

select * from viewname
where columnname = 'A'  -- the condition
order by columnname 
limit 5;

However, instead of 'limit 5', I want to take a percentage of the number of rows which meet this condition. So for instance, 'limit 5%' (though this is not correct syntax). I understand a similar thing can be done with the tablesample clause but this does not apply to views.


回答1:

You could use the window function PERCENT_RANK

SELECT *
FROM 
(
  select *, PERCENT_RANK() OVER (PARTITION BY columnname ORDER BY random()) AS pcrnk
  from tablename
  where columnname = 'A'
) q
WHERE pcrnk <= 0.05

And if you don't want to see that pcrnk in the result?

SELECT (t).*
FROM 
(
  select t, PERCENT_RANK() OVER (PARTITION BY columnname ORDER BY random()) AS pcrnk
  from tablename t
  where columnname = 'A'
) q
WHERE pcrnk <= 0.05 

Test on db<>fiddle here

These queries will retrieve 5% of what be retrieved normally based on the criteria columnname = 'A' F.e. if there are 100 'A' and 1000 'B', then they return 5 records.

If you want to return 5% of all the records in the table? Then here's another trick.

select *
from tablename
where columnname = 'A'
order by random()
limit 0.05 * (select count(*) from tablename)

回答2:

In order to randomly select a percentage of your rows, and if you have Postgres 9.5 or higher, have a look at Postgres TABLESAMPLE.

It has two options : BERNOULLI and SYSTEM :

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. [...] These two methods each return a randomly-chosen sample of the table that will contain approximately the specified percentage of the table's rows.

SYSTEM is faster, but BERNOULLI gives better random distribution because each record has the same probability on being selected.

SELECT * 
FROM tablename TABLESAMPLE SYSTEM(5)
WHERE columnname = 'A'  -- the condition
ORDER BY columnname;

NB : this only works if you are querying a table, and not for views.

  • 发表于 2019-01-17 02:25
  • 阅读 ( 271 )
  • 分类:网络文章

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除