towhee.functional.mixins.data_processing.DataProcessingMixin¶
- class towhee.functional.mixins.data_processing.DataProcessingMixin[source]¶
Bases:
object
Mixin for processing data.
Methods
Create small batches from data collections.
Flatten nested data collections.
Get the first n lines of a DataCollection.
Create rolling windows from data collections.
Sample the data collection.
Select data from dc with list(self).
Shuffle an unstreamed data collection in place.
Combine two data collections.
- batch(size, drop_tail=False, raw=True)[source]¶
Create small batches from data collections.
- Parameters:
size (int) – Window size;
drop_tail (bool) – Drop tailing windows that not full, defaults to False;
raw (bool) – Whether to return raw data instead of DataCollection, defaults to True
- Returns:
DataCollection of batched windows or batch raw data
Examples:
>>> from towhee import DataCollection >>> dc = DataCollection(range(10)) >>> [list(batch) for batch in dc.batch(2, raw=False)] [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
>>> dc = DataCollection(range(10)) >>> dc.batch(3) [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> dc = DataCollection(range(10)) >>> dc.batch(3, drop_tail=True) [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
>>> from towhee import Entity >>> dc = DataCollection([Entity(a=a, b=b) for a,b in zip(['abc', 'vdfvcd', 'cdsc'], [1,2,3])]) >>> dc.batch(2) [<Entity dict_keys(['a', 'b'])>, <Entity dict_keys(['a', 'b'])>]
- flatten() DataCollection [source]¶
Flatten nested data collections.
- Returns:
flattened data collection;
- Return type:
Examples:
>>> from towhee import DataCollection >>> dc = DataCollection(range(10)) >>> nested_dc = dc.batch(2) >>> nested_dc.flatten().to_list() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
- head(n: int = 5)[source]¶
Get the first n lines of a DataCollection.
- Parameters:
n (int) – The number of lines to print. Default value is 5.
Examples:
>>> from towhee import DataCollection >>> DataCollection.range(10).head(3).to_list() [0, 1, 2]
- rolling(size: int, drop_head=True, drop_tail=True)[source]¶
Create rolling windows from data collections.
- Parameters:
size (int) – Wndow size.
drop_head (bool) – Drop headding windows that not full.
drop_tail (bool) – Drop tailing windows that not full.
- Returns:
data collection of rolling windows;
- Return type:
Examples:
>>> from towhee import DataCollection >>> dc = DataCollection(range(5)) >>> [list(batch) for batch in dc.rolling(3)] [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
>>> dc = DataCollection(range(5)) >>> [list(batch) for batch in dc.rolling(3, drop_head=False)] [[0], [0, 1], [0, 1, 2], [1, 2, 3], [2, 3, 4]]
>>> dc = DataCollection(range(5)) >>> [list(batch) for batch in dc.rolling(3, drop_tail=False)] [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4], [4]]
- sample(ratio=1.0) DataCollection [source]¶
Sample the data collection.
- Parameters:
ratio (float) – sample ratio;
- Returns:
sampled data collection;
- Return type:
Examples:
>>> from towhee import DataCollection >>> dc = DataCollection(range(10000)) >>> result = dc.sample(0.1) >>> ratio = len(result.to_list()) / 10000. >>> 0.09 < ratio < 0.11 True
- select_from(other)[source]¶
Select data from dc with list(self).
Examples:
>>> from towhee import DataCollection >>> dc1 = DataCollection([0.8, 0.9, 8.1, 9.2]) >>> dc2 = DataCollection([[1, 2, 0], [2, 3, 0]])
>>> dc3 = dc2.select_from(dc1) >>> list(dc3) [[0.9, 8.1, 0.8], [8.1, 9.2, 0.8]]
- shuffle() DataCollection [source]¶
Shuffle an unstreamed data collection in place.
- Returns:
shuffled data collection;
- Return type:
Examples:
Shuffle:
>>> from towhee import DataCollection >>> dc = DataCollection([0, 1, 2, 3, 4]) >>> a = dc.shuffle() >>> tuple(a) == tuple(range(5)) False
streamed data collection is not supported:
>>> dc = DataCollection([0, 1, 2, 3, 4]).stream() >>> _ = dc.shuffle() Traceback (most recent call last): TypeError: shuffle is not supported for streamed data collection.
- zip(*others) DataCollection [source]¶
Combine two data collections.
- Parameters:
*others (DataCollection) – other data collections;
- Returns:
data collection with zipped values;
- Return type:
Examples:
>>> from towhee import DataCollection >>> dc1 = DataCollection([1,2,3,4]) >>> dc2 = DataCollection([1,2,3,4]).map(lambda x: x+1) >>> dc3 = dc1.zip(dc2) >>> list(dc3) [(1, 2), (2, 3), (3, 4), (4, 5)]