towhee.dataframe.dataframe.DataFrame

class towhee.dataframe.dataframe.DataFrame(name: str, columns: List[Tuple[str, str]])[source]

Bases: object

A DataFrame is a collection of immutable, potentixally heterogeneous blogs of data.

Parameters:
  • columns (list[Tuple[str, Any]]) – The list of the column names and their corresponding data types.

  • name (str) – Name of the dataframe; DataFrame names should be the same as its representation.

Methods

ack

An acknowledgement (ack) will notice the `DataFrame`s that the rows already iterated over are no longer used, and can be garbage collected. Up to this index but not including.

gc

gc_blocked

Garbage collection for blocked iterator list.

gc_data

Garbage collection function to trigger towhee.Array GC.

get

Get data from dataframe based on offset and how many values requested.

get_window

Window based data retrieval.

notify_map_block

Used by map based iterators to notify the dataframe that it is waiting on a value.

notify_window_block

Used by window based iterators to notify the dataframe that it is waiting on a value.

put

Append a new value to the dataframe.

register_iter

Registering an iter allows the df to keep track of where each iterator is in order to coordinate garbage collection and other processes.

remove_iter

Removing the iterator from the df.

seal

Function to seal the dataframe.

unblock_all

Forceful unblocking/killing of all blocked iter.

unblock_iter

Forceful unblocking/killing of the selected iter.

Attributes

columns

current_size

data

iterators

name

sealed

size

types

__getitem__(key)[source]

Get data at the passed in offset.

__init__(name: str, columns: List[Tuple[str, str]])[source]
ack(offset: int, iter_id: int)[source]

An acknowledgement (ack) will notice the `DataFrame`s that the rows already iterated over are no longer used, and can be garbage collected. Up to this index but not including.

Parameters:
  • offset (int) – The latest accepted offset.

  • iter_id (int) – The iterator id to set offset for.

gc_blocked()[source]

Garbage collection for blocked iterator list.

gc_data()[source]

Garbage collection function to trigger towhee.Array GC.

get(offset: int, count: int = 1, iter_id: int = False)[source]

Get data from dataframe based on offset and how many values requested.

Parameters:
  • offset (int) – The starting index to read data from.

  • count (int) – How many values to retrieve.

  • iter_id (int) – Which iterator is reading the data.

Returns:

Different states of response codes depending on the dataframe state.

Return type:

Responses

get_window(start: int, end: int, step: int, comparator: str, iter_id: int = False)[source]

Window based data retrieval. Windows include everything up to but not including the cutoff.

Parameters:
  • start (int) – The starting index to read data from.

  • end (int) – The current upper window limit.

  • step (int) – How far the window will slide for next call

  • comparator ('timestamp' | 'row_id') – Which value is being used for window calculations.

  • iter_id (int) – Which iterator is reading the data.

Returns:

Different states of response codes depending on the dataframe state.

Return type:

Responses

Situations:

Unsealed:

[0, 1, 2, 3] window = (5, 10) 1. Window start is greater than the latest value in df:

  1. block for value greater than window start

[0, 1, 2, 3] window = (0, 10) 2. Window end is greater than the latest value in df:

  1. block for value that is greater than window end

[5, 6, 7, 8] window = (1, 4) 3. Window end is smaller than all values in df

  1. return None and continue to next window

[0, 1, 3, 4] window = (0, 3) 4. Values exist between window start and window end:

  1. return values, clear data up to first value

Sealed:

[0, 1, 2, 3] window = (5, 10) 1. Window start is greater than the latest value in df:

  1. Stop iteration

[0, 1, 2, 3] window = (0, 10) 2. Window end is greater than the latest value in df:

  1. return remaining window values and stop iteration

[5, 6, 7, 8] window = (1, 4) 3. Window end is smaller than all values in df

  1. return None and continue to next window

[0, 1, 3, 4] window = (0, 3) 4. Values exist between window start and window end:

  1. return values, clear data up to first value

notify_map_block(event: Event, offset: int, count: int, iter_id: int)[source]

Used by map based iterators to notify the dataframe that it is waiting on a value.

Parameters:
  • event (threading.Event) – The event to trigger once the value being waited on is available.

  • offset (int) – The offset that is being waited on.

  • count (`int) – How many values are being waited on.

  • iter_id (int) – The iterator id of the blocked iterator.

notify_window_block(event: Event, edge: str, cutoff: Tuple[str, int], iter_id: int)[source]

Used by window based iterators to notify the dataframe that it is waiting on a value.

Parameters:
  • iter_id (int) – The iterator id of the blocked iterator.

  • event (threading.Event) – The event to trigger once the value being waited on is available.

  • edge ('start', 'end') – Waiting for a value >= than start, or waiting for a value greater then end.

  • cutoff (“timestamp” | “row_id”, int) – The window cutoff that is is being waited on.

put(item) None[source]

Append a new value to the dataframe.

Parameters:

item (dict | list | Array | tuple) – The row to insert.

register_iter()[source]

Registering an iter allows the df to keep track of where each iterator is in order to coordinate garbage collection and other processes.

Returns:

An iterator ID that is used for a majority of iterator-dataframe interaction.

remove_iter(iter_id: int)[source]

Removing the iterator from the df. Allows for more garbage collecting. :param iter_id: The iterator’s id. :type iter_id: int

seal()[source]

Function to seal the dataframe. A sealed dataframe will not accept anymore data and the act of sealing the dataframe will trigger blocked iterators to pull the remaining data.

unblock_all()[source]

Forceful unblocking/killing of all blocked iter.

unblock_iter(iter_id: int)[source]

Forceful unblocking/killing of the selected iter.

Parameters:

iter_id (int) – The id of the iterator being unblocked.