Get last row by timestamp in pig
I have some slowly changing metadata that is stored in real-time onto
HDFS. I would like to write a pig job that condenses these rows down to
the most recent row for each key.
For example, for these data (column headers added for clarity):
ts meta key
-- ---- ---
1 foo id1
2 que id2
3 que id2
4 foo id1
5 pasa id2
6 pasa id2
7 foo id1
8 pasa id2
9 pasa id2
10 pasa id2
11 pasa id2
12 hombre id2
13 foo id1
14 foo id1
15 hombre id2
16 bar id1
17 bar id1
18 bar id1
19 bar id1
20 bar id1
I would expect to get the output:
15 hombre id2
20 bar id1
Is there a built-in way to do this in pig or some library already, or
should I look at writing a UDF?
No comments:
Post a Comment