Remove word from string based on matching item in array

问题: I have this table: Assume that "florio" is a city contained somewhere in the AllLocationTerms array column. How do I remove "florio" when it exists on my list of loc...

问题:

I have this table:

Big Query Table

Assume that "florio" is a city contained somewhere in the AllLocationTerms array column.

How do I remove "florio" when it exists on my list of locations in AllLocationTerms array column?

Basically, I want to remove all matching items in AllLocationTerms from "Query" column.

It can happen that there are 2 or more words - "new york apartments" as Query and "new", "york" in the array. The outcome should be "apartments" in this case.


回答1:

Below is to address use case like yours when you need to check 800,000 rows against the location list array which has around 40k items
So, 40K items are definitely too much for being used to construct regular expression as it is in my previous answer.
So, to address this, I propose to rather split your query string into separate words preserving position number - then exclude those which are terms by left joining and finally assemble the survived words back to string

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *,
  (
    SELECT STRING_AGG(word, ' ' ORDER BY pos) 
    FROM (
      SELECT word, MIN(pos) pos 
      FROM UNNEST(SPLIT(query, ' ')) word WITH OFFSET AS pos
      LEFT JOIN UNNEST(allLocationTerms) term 
      ON word = term
      GROUP BY word
      HAVING COUNT(DISTINCT term) = 0
    )
  ) modified_query
FROM `project.dataset.table`

回答2:

Below is for BigQuery Standard SQL

#standardSQL
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\b', STRING_AGG(term, '\b|\b'), '\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`   

you can test, play with above using dummy data as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\b', STRING_AGG(term, '\b|\b'), '\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`    

result is

Row query                              clicks   allLocationTerms    modified_query   
1   florio management apartments            1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       
2   florio creek management iowa apartments 1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       
  • 发表于 2018-07-08 01:05
  • 阅读 ( 295 )
  • 分类:sof

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除