DataCollection¶
- class towhee.DataCollection(iterable: Iterable)[source]¶
Bases:
Iterable
,DCMixins
A pythonic computation and processing framework.
DataCollection is a pythonic computation and processing framework for unstructured data in machine learning and data science. It allows a data scientist or researcher to assemble data processing pipelines and do their model work (embedding, transforming, or classification) with a method-chaining style API. It is also designed to behave as a python list or iterator. When created from a list, operations arent performed once all data has been stored from previous step. When created from an iterator, operations are performed streamwise, reading and operating on data one by one, and only progressing if its previous output has been consumed.
Examples
Create a DataCollection from list or iterator:
>>> dc = DataCollection([0, 1, 2, 3, 4]) >>> dc = DataCollection(iter([0, 1, 2, 3, 4]))
Chaining function invocations makes your code clean and fluent:
>>> ( ... dc.map(lambda x: x+1) ... .map(lambda x: x*2) ... ).to_list() [2, 4, 6, 8, 10]
Multi-line closures are also supported via decorator syntax:
>>> dc = DataCollection([1,2,3,4]) >>> @dc.map ... def add1(x): ... return x+1 >>> @add1.map ... def mul2(x): ... return x *2 >>> @mul2.filter ... def ge3(x): ... return x>=7 >>> ge3.to_list() [8, 10]
- __init__(iterable: Iterable) None [source]¶
Initializes a new DataCollection instance.
- Parameters:
iterable (Iterable) – The iterable data that is stored in the DataCollection.
- __iter__() iter [source]¶
Generate an iterator of the DataCollection.
- Returns:
iterator for the data.
- Return type:
iter
- __getattr__(name) DataCollection [source]¶
Unknown method dispatcher.
When an unknown method is invoked on a DataCollection object, the function call will be dispatched to a method resolver. By registering function to the resolver, you are able to extend DataCollection’s API at runtime without modifying its code.
- Parameters:
name (str) – The unknown attribute.
- Returns:
- Returns a new DataCollection for the output of attribute
call.
- Return type:
Examples
>>> from towhee import register >>> dc = DataCollection([1,2,3,4]) >>> @register(name='test/add1') ... def add1(x): ... return x+1 >>> dc.test.add1().to_list() [2, 3, 4, 5]
- __getitem__(index) any [source]¶
Index based access of element in DataCollection.
Access the element at the given index, similar to accessing list[at_index]. Does not work with streamed DataCollections.
- Parameters:
index (int) – The index location of the element being accessed.
- Raises:
TypeError – If function called on streamed DataCollection
- Returns:
The object at index.
- Return type:
any
Examples
Usage with non-streamed:
>>> dc = DataCollection([0, 1, 2, 3, 4]) >>> dc[2] 2
Usage with streamed:
>>> dc.stream()[1] Traceback (most recent call last): TypeError: indexing is only supported for DataCollection created from list or pandas DataFrame.
- __setitem__(index, value)[source]¶
Index based setting of element in DataCollection.
Assign the value of the element at the given index, similar to list[at_index]=val. Does not work with streamed DataCollections.
- Parameters:
index (int) – The index location of the element being set.
val (any) – The value to be set.
- Raises:
TypeError – If function called on streamed DataCollection
Examples
Usage with non-streamed:
>>> dc = DataCollection([0, 1, 2, 3, 4]) >>> dc[2] = 3 >>> dc.to_list() [0, 1, 3, 3, 4]
Usage with streamed:
>>> dc.stream()[1] Traceback (most recent call last): TypeError: indexing is only supported for DataCollection created from list or pandas DataFrame.
- __add__(other) DataCollection [source]¶
Concat two DataCollections.
- Parameters:
other (DataCollection) – The DataCollection being appended to the calling DataFrame.
- Returns:
A new DataCollection of the concated DataCollections.
- Return type:
Examples
>>> dc0 = DataCollection.range(5) >>> dc1 = DataCollection.range(5) >>> dc2 = DataCollection.range(5) >>> (dc0 + dc1 + dc2) [0, 1, 2, 3, 4, 0, ...]
- __repr__() str [source]¶
String representation of the DataCollection
- Returns:
String representation of the DataCollection.
- Return type:
str
Examples
Usage with non-streamed:
>>> DataCollection([1, 2, 3]).unstream() [1, 2, 3]
Usage with streamed:
>>> DataCollection([1, 2, 3]).stream() <list_iterator object at...>
- static range(*arg, **kws) DataCollection [source]¶
Generate DataCollection with range of values.
Generate DataCollection with a range of numbers as the data. Functions in same way as Python range() function.
- Returns:
Returns a new DataCollection.
- Return type:
Examples
>>> DataCollection.range(5).to_list() [0, 1, 2, 3, 4]
- to_list() list [source]¶
Convert DataCollection to list.
- Returns:
List of values stored in DataCollection.
- Return type:
list
Examples
>>> DataCollection.range(5).to_list() [0, 1, 2, 3, 4]
- map(*arg) DataCollection [source]¶
Apply a function across all values in a DataCollection.
Can apply multiple functions to the DataCollection. If multiple functions supplied, the same amount of new DataCollections will be returend.
- Parameters:
*arg (Callable) – One or multiple functions to apply to the DataCollection.
- Returns:
New DataCollection containing computation results.
- Return type:
Examples
Single Function:
>>> dc = DataCollection([1,2,3,4]) >>> dc.map(lambda x: x+1).map(lambda x: x*2).to_list() [4, 6, 8, 10]
Multiple Functions:
>>> dc = DataCollection([1,2,3,4]) >>> a, b = dc.map(lambda x: x+1, lambda x: x*2) >>> (a.to_list(), b.to_list()) ([2, 3, 4, 5], [2, 4, 6, 8])
- filter(unary_op: Callable, drop_empty=False) DataCollection [source]¶
Filter the DataCollection data based on function.
Filters the DataCollection based on the function provided. If data is stored as an Option (see towhee.functional.option.py), drop empty will decide whether to remove the element or set it to empty.
- Parameters:
unary_op (Callable) – Function that dictates filtering.
drop_empty (bool, optional) – Whether to drop empty fields. Defaults to False.
- Returns:
Resulting DataCollection after filter.
- Return type:
- run()[source]¶
Iterate through the DataCollections data.
Stream-based DataCollections will not run if the data is not a datasink. This function is a datasink that consumes the data without any operations.
- to_df() DataFrame [source]¶
Turn a DataCollection into a DataFrame.
- Returns:
Resulting converted DataFrame.
- Return type:
Examples
>>> from towhee import DataCollection, Entity >>> e = [Entity(a=a, b=b) for a,b in zip(['abc', 'def', 'ghi'], [1,2,3])] >>> dc = DataCollection(e) >>> type(dc) <class 'towhee.functional.data_collection.DataCollection'>
>>> type(dc.to_df()) <class 'towhee.functional.data_collection.DataFrame'>
- __weakref__¶
list of weak references to the object (if defined)