Table Of Contents

VertexFrame sorted_k


sorted_k(self, k, column_names_and_ascending, reduce_tree_depth=None)

[ALPHA] Get a sorted subset of the data.

Parameters:

k : int32

Number of sorted records to return.

column_names_and_ascending : list

Column names to sort by, and true to sort column by ascending order, or false for descending order.

reduce_tree_depth : int32 (default=None)

Advanced tuning parameter which determines the depth of the reduce-tree (uses Spark’s treeReduce() for scalability.) Default is 2.

Returns:

: <bound method AtkEntityType.__name__ of <trustedanalytics.rest.jsonschema.AtkEntityType object at 0x7f9e686f3fd0>>

A new frame with a subset of sorted rows from the original frame.

Take a number of rows and return them sorted in either ascending or descending order.

Sorting a subset of rows is more efficient than sorting the entire frame when the number of sorted rows is much less than the total number of rows in the frame.

Notes

The number of sorted rows should be much smaller than the number of rows in the original frame.

In particular:

  1. The number of sorted rows returned should fit in Spark driver memory. The maximum size of serialized results that can fit in the Spark driver is set by the Spark configuration parameter spark.driver.maxResultSize.
  2. If you encounter a Kryo buffer overflow exception, increase the Spark configuration parameter spark.kryoserializer.buffer.max.mb.
  3. Use Frame.sort() instead if the number of sorted rows is very large (in other words, it cannot fit in Spark driver memory).

Examples

These examples deal with the most recently-released movies in a private collection. Consider the movie collection already stored in the frame below:

>>> big_frame.inspect(10)

  genre:str  year:int32   title:str
/-----------------------------------/
  Drama        1957       12 Angry Men
  Crime        1946       The Big Sleep
  Western      1969       Butch Cassidy and the Sundance Kid
  Drama        1971       A Clockwork Orange
  Drama        2008       The Dark Knight
  Animation    2013       Frozen
  Drama        1972       The Godfather
  Animation    1994       The Lion King
  Animation    2010       Tangled
  Fantasy      1939       The Wonderful Wizard of Oz

This example returns the top 3 rows sorted by a single column: ‘year’ descending:

>>> topk_frame = big_frame.sorted_k(3, [ ('year', False) ])
>>> topk_frame.inspect()

  genre:str  year:int32   title:str
/-----------------------------------/
  Animation    2013       Frozen
  Animation    2010       Tangled
  Drama        2008       The Dark Knight

This example returns the top 5 rows sorted by multiple columns: ‘genre’ ascending, then ‘year’ descending:

>>> topk_frame = big_frame.sorted_k(5, [ ('genre', True), ('year', False) ])
>>> topk_frame.inspect()

  genre:str  year:int32   title:str
/-----------------------------------/
  Animation    2013       Frozen
  Animation    2010       Tangled
  Animation    1994       The Lion King
  Crime        1946       The Big Sleep
  Drama        2008       The Dark Knight

This example returns the top 5 rows sorted by multiple columns: ‘genre’ ascending, then ‘year’ ascending. It also illustrates the optional tuning parameter for reduce-tree depth (which does not affect the final result).

>>> topk_frame = big_frame.sorted_k(5, [ ('genre', True), ('year', True) ], reduce_tree_depth=1)
>>> topk_frame.inspect()

  genre:str  year:int32   title:str
/-----------------------------------/
  Animation    1994       The Lion King
  Animation    2010       Tangled
  Animation    2013       Frozen
  Crime        1946       The Big Sleep
  Drama        1972       The Godfather