towhee.functional.mixins.data_processing.DataProcessingMixin

class towhee.functional.mixins.data_processing.DataProcessingMixin[source]

Bases: object

Mixin for processing data.

Methods

batch

Create batches from the DataCollection.

combine

Combine dataframes to be able to access schemas from seperate DF chains.

flatten

Flatten nested data within DataCollection.

group_by

Merge columns in DataCollection.

head

Return the first n values of a DataCollection.

rolling

Create rolling windows from DataCollection.

sample

Sample the data collection.

select_from

Select data from dc with list(self).

shuffle

Shuffle an unstreamed data collection in place.

zip

Combine multiple data collections.

batch(size, drop_tail=False)[source]

Create batches from the DataCollection.

Parameters:
  • size (int) – Window size.

  • drop_tail (bool) – Drop trailing window that is not full, defaults to False.

Returns:

Batched DataCollection.

Return type:

DataCollection

Examples

>>> from towhee import DataCollection
>>> dc = DataCollection(range(10))
>>> [list(batch) for batch in dc.batch(2)]
[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
>>> dc = DataCollection(range(10))
>>> dc.batch(3)
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> dc = DataCollection(range(10))
>>> dc.batch(3, drop_tail=True)
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
>>> from towhee import Entity
>>> dc = DataCollection([Entity(a=a, b=b) for a,b in zip(['abc', 'vdfvcd', 'cdsc'], [1,2,3])])
>>> dc.batch(2)
[[<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>], [<Entity dict_keys(['a', 'b'])>]]
classmethod combine(*datacollections)[source]

Combine dataframes to be able to access schemas from seperate DF chains.

Parameters:

datacollections (DataFrame) – DataFrames to combine.

Examples

>>> import towhee
>>> a = towhee.range['a'](1,5)
>>> b = towhee.range['b'](5,10)
>>> c = towhee.range['c'](10, 15)
>>> z = towhee.DataFrame.combine(a, b, c)
>>> z.as_raw().to_list()
[(1, 5, 10), (2, 6, 11), (3, 7, 12), (4, 8, 13)]
flatten(*args) DataCollection[source]

Flatten nested data within DataCollection.

Returns:

Flattened DataCollection.

Return type:

DataCollection

Examples

>>> from towhee import DataCollection, Entity
>>> dc = DataCollection(range(10))
>>> nested_dc = dc.batch(2)
>>> nested_dc.flatten().to_list()
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> g = (i for i in range(3))
>>> e = Entity(a=1, b=2, c=g)
>>> dc = DataCollection([e]).flatten('c')
>>> [str(i) for i in dc]
["{'a': 1, 'b': 2, 'c': 0}", "{'a': 1, 'b': 2, 'c': 1}", "{'a': 1, 'b': 2, 'c': 2}"]
group_by(index) DataCollection[source]

Merge columns in DataCollection. Unstreamed data only.

Examples

>>> import towhee
>>> dc = towhee.dc['a']([1,1,2,2,3,3])
>>> [i.a for i in dc]
[1, 1, 2, 2, 3, 3]
>>> dc = dc.group_by('a')
>>> [i.a for i in dc]
[1, 2, 3]
head(n: int = 5)[source]

Return the first n values of a DataCollection.

Parameters:

n (int, optional) – The amount to select, defaults to 5.

Returns:

DataCollection with the selected values.

Return type:

DataCollection

rolling(size: int, step: int = 1, drop_head=True, drop_tail=True)[source]

Create rolling windows from DataCollection.

Parameters:
  • size (int) – Window size.

  • drop_head (bool) – Drop head windows that are not full.

  • drop_tail (bool) – Drop trailing windows that are not full.

Returns:

DataCollection of rolling windows.

Return type:

DataCollection

Examples

>>> from towhee import DataCollection
>>> dc = DataCollection(range(5))
>>> [list(batch) for batch in dc.rolling(3)]
[[0, 1, 2], [1, 2, 3], [2, 3, 4]]
>>> dc = DataCollection(range(5))
>>> [list(batch) for batch in dc.rolling(3, drop_head=False)]
[[0], [0, 1], [0, 1, 2], [1, 2, 3], [2, 3, 4]]
>>> dc = DataCollection(range(5))
>>> [list(batch) for batch in dc.rolling(3, drop_tail=False)]
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4], [4]]
>>> from towhee import DataCollection
>>> dc = DataCollection(range(5))
>>> dc.rolling(2, 2, drop_head=False, drop_tail=False)
[[0], [0, 1], [2, 3], [4]]
>>> from towhee import DataCollection
>>> dc = DataCollection(range(5))
>>> dc.rolling(2, 4, drop_head=False, drop_tail=False)
[[0], [0, 1], [4]]
sample(ratio=1.0) DataCollection[source]

Sample the data collection.

Parameters:

ratio (float) – sample ratio.

Returns:

Sampled data collection.

Return type:

DataCollection

Examples

>>> from towhee import DataCollection
>>> dc = DataCollection(range(10000))
>>> result = dc.sample(0.1)
>>> ratio = len(result.to_list()) / 10000.
>>> 0.09 < ratio < 0.11
True
select_from(other)[source]

Select data from dc with list(self).

Parameters:

other (DataCollection) – DataCollection to select from.

Examples

>>> from towhee import DataCollection
>>> dc1 = DataCollection([0.8, 0.9, 8.1, 9.2])
>>> dc2 = DataCollection([[1, 2, 0], [2, 3, 0]])
>>> dc3 = dc2.select_from(dc1)
>>> list(dc3)
[[0.9, 8.1, 0.8], [8.1, 9.2, 0.8]]
shuffle() DataCollection[source]

Shuffle an unstreamed data collection in place.

Returns:

Shuffled data collection.

Return type:

DataCollection

Examples

1. Shuffle: >>> from towhee import DataCollection >>> dc = DataCollection([0, 1, 2, 3, 4]) >>> a = dc.shuffle() >>> tuple(a) == tuple(range(5)) False

2. Streamed data collection is not supported: >>> dc = DataCollection([0, 1, 2, 3, 4]).stream() >>> _ = dc.shuffle() Traceback (most recent call last): TypeError: shuffle is not supported for streamed data collection.

zip(*others) DataCollection[source]

Combine multiple data collections.

Parameters:

*others (DataCollection) – The other data collections.

Returns:

Data collection with zipped values.

Return type:

DataCollection

Examples

>>> from towhee import DataCollection
>>> dc1 = DataCollection([1,2,3,4])
>>> dc2 = DataCollection([1,2,3,4]).map(lambda x: x+1)
>>> dc3 = dc1.zip(dc2)
>>> list(dc3)
[(1, 2), (2, 3), (3, 4), (4, 5)]