towhee.functional.mixins.data_processing.DataProcessingMixin¶
- class towhee.functional.mixins.data_processing.DataProcessingMixin[source]¶
Bases:
object
Mixin for processing data.
Methods
Create batches from the DataCollection.
Combine dataframes to be able to access schemas from seperate DF chains.
Flatten nested data within DataCollection.
Merge columns in DataCollection.
Return the first n values of a DataCollection.
Create rolling windows from DataCollection.
Sample the data collection.
Select data from dc with list(self).
Shuffle an unstreamed data collection in place.
Combine multiple data collections.
- batch(size, drop_tail=False)[source]¶
Create batches from the DataCollection.
- Parameters:
size (int) – Window size.
drop_tail (bool) – Drop trailing window that is not full, defaults to False.
- Returns:
Batched DataCollection.
- Return type:
Examples
>>> from towhee import DataCollection >>> dc = DataCollection(range(10)) >>> [list(batch) for batch in dc.batch(2)] [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
>>> dc = DataCollection(range(10)) >>> dc.batch(3) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> dc = DataCollection(range(10)) >>> dc.batch(3, drop_tail=True) [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
>>> from towhee import Entity >>> dc = DataCollection([Entity(a=a, b=b) for a,b in zip(['abc', 'vdfvcd', 'cdsc'], [1,2,3])]) >>> dc.batch(2) [[<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>], [<Entity dict_keys(['a', 'b'])>]]
- classmethod combine(*datacollections)[source]¶
Combine dataframes to be able to access schemas from seperate DF chains.
- Parameters:
datacollections (DataFrame) – DataFrames to combine.
Examples
>>> import towhee >>> a = towhee.range['a'](1,5) >>> b = towhee.range['b'](5,10) >>> c = towhee.range['c'](10, 15) >>> z = towhee.DataFrame.combine(a, b, c) >>> z.as_raw().to_list() [(1, 5, 10), (2, 6, 11), (3, 7, 12), (4, 8, 13)]
- flatten(*args) DataCollection [source]¶
Flatten nested data within DataCollection.
- Returns:
Flattened DataCollection.
- Return type:
Examples
>>> from towhee import DataCollection, Entity >>> dc = DataCollection(range(10)) >>> nested_dc = dc.batch(2) >>> nested_dc.flatten().to_list() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> g = (i for i in range(3)) >>> e = Entity(a=1, b=2, c=g) >>> dc = DataCollection([e]).flatten('c') >>> [str(i) for i in dc] ["{'a': 1, 'b': 2, 'c': 0}", "{'a': 1, 'b': 2, 'c': 1}", "{'a': 1, 'b': 2, 'c': 2}"]
- group_by(index) DataCollection [source]¶
Merge columns in DataCollection. Unstreamed data only.
Examples
>>> import towhee >>> dc = towhee.dc['a']([1,1,2,2,3,3]) >>> [i.a for i in dc] [1, 1, 2, 2, 3, 3]
>>> dc = dc.group_by('a') >>> [i.a for i in dc] [1, 2, 3]
- head(n: int = 5)[source]¶
Return the first n values of a DataCollection.
- Parameters:
n (int, optional) – The amount to select, defaults to 5.
- Returns:
DataCollection with the selected values.
- Return type:
- rolling(size: int, step: int = 1, drop_head=True, drop_tail=True)[source]¶
Create rolling windows from DataCollection.
- Parameters:
size (int) – Window size.
drop_head (bool) – Drop head windows that are not full.
drop_tail (bool) – Drop trailing windows that are not full.
- Returns:
DataCollection of rolling windows.
- Return type:
Examples
>>> from towhee import DataCollection >>> dc = DataCollection(range(5)) >>> [list(batch) for batch in dc.rolling(3)] [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
>>> dc = DataCollection(range(5)) >>> [list(batch) for batch in dc.rolling(3, drop_head=False)] [[0], [0, 1], [0, 1, 2], [1, 2, 3], [2, 3, 4]]
>>> dc = DataCollection(range(5)) >>> [list(batch) for batch in dc.rolling(3, drop_tail=False)] [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4], [4]]
>>> from towhee import DataCollection >>> dc = DataCollection(range(5)) >>> dc.rolling(2, 2, drop_head=False, drop_tail=False) [[0], [0, 1], [2, 3], [4]]
>>> from towhee import DataCollection >>> dc = DataCollection(range(5)) >>> dc.rolling(2, 4, drop_head=False, drop_tail=False) [[0], [0, 1], [4]]
- sample(ratio=1.0) DataCollection [source]¶
Sample the data collection.
- Parameters:
ratio (float) – sample ratio.
- Returns:
Sampled data collection.
- Return type:
Examples
>>> from towhee import DataCollection >>> dc = DataCollection(range(10000)) >>> result = dc.sample(0.1) >>> ratio = len(result.to_list()) / 10000. >>> 0.09 < ratio < 0.11 True
- select_from(other)[source]¶
Select data from dc with list(self).
- Parameters:
other (DataCollection) – DataCollection to select from.
Examples
>>> from towhee import DataCollection >>> dc1 = DataCollection([0.8, 0.9, 8.1, 9.2]) >>> dc2 = DataCollection([[1, 2, 0], [2, 3, 0]])
>>> dc3 = dc2.select_from(dc1) >>> list(dc3) [[0.9, 8.1, 0.8], [8.1, 9.2, 0.8]]
- shuffle() DataCollection [source]¶
Shuffle an unstreamed data collection in place.
- Returns:
Shuffled data collection.
- Return type:
Examples
1. Shuffle: >>> from towhee import DataCollection >>> dc = DataCollection([0, 1, 2, 3, 4]) >>> a = dc.shuffle() >>> tuple(a) == tuple(range(5)) False
2. Streamed data collection is not supported: >>> dc = DataCollection([0, 1, 2, 3, 4]).stream() >>> _ = dc.shuffle() Traceback (most recent call last): TypeError: shuffle is not supported for streamed data collection.
- zip(*others) DataCollection [source]¶
Combine multiple data collections.
- Parameters:
*others (DataCollection) – The other data collections.
- Returns:
Data collection with zipped values.
- Return type:
Examples
>>> from towhee import DataCollection >>> dc1 = DataCollection([1,2,3,4]) >>> dc2 = DataCollection([1,2,3,4]).map(lambda x: x+1) >>> dc3 = dc1.zip(dc2) >>> list(dc3) [(1, 2), (2, 3), (3, 4), (4, 5)]