Table Of Contents

EdgeFrame bin_column_equal_depth


bin_column_equal_depth(self, column_name, num_bins=None, bin_column_name=None)

Classify column into groups with the same frequency.

Parameters:

column_name : unicode

The column whose values are to be binned.

num_bins : int32 (default=None)

The maximum number of bins. Default is the Square-root choice \lfloor \sqrt{m} \rfloor, where m is the number of rows.

bin_column_name : unicode (default=None)

The name for the new column holding the grouping labels. Default is <column_name>_binned.

Returns:

: dict

A list containing the edges of each bin.

Group rows of data based on the value in a single column and add a label to identify grouping.

Equal depth binning attempts to label rows such that each bin contains the same number of elements. For n bins of a column C of length m, the bin number is determined by:

\lceil n * \frac { f(C) }{ m } \rceil

where f is a tie-adjusted ranking function over values of C. If there are multiples of the same value in C, then their tie-adjusted rank is the average of their ordered rank values.

Notes

  1. Unicode in column names is not supported and will likely cause the drop_frames() method (and others) to fail!
  2. The num_bins parameter is considered to be the maximum permissible number of bins because the data may dictate fewer bins. For example, if the column to be binned has a quantity of :math”X elements with only 2 distinct values and the num_bins parameter is greater than 2, then the actual number of bins will only be 2. This is due to a restriction that elements with an identical value must belong to the same bin.

Examples

Given a frame with column a accessed by a Frame object my_frame:

>>> my_frame.inspect( n=11 )

  a:int32
/---------/
    1
    1
    2
    3
    5
    8
   13
   21
   34
   55
   89

Modify the frame, adding a column showing what bin the data is in. The data should be grouped into a maximum of five bins. Note that each bin will have the same quantity of members (as much as possible):

>>> cutoffs = my_frame.bin_column_equal_depth('a', 5, 'aEDBinned')
>>> my_frame.inspect( n=11 )

  a:int32     aEDBinned:int32
/-----------------------------/
      1                   0
      1                   0
      2                   1
      3                   1
      5                   2
      8                   2
     13                   3
     21                   3
     34                   4
     55                   4
     89                   4

>>> print cutoffs
[1.0, 2.0, 5.0, 13.0, 34.0, 89.0]