DataCollection¶

class towhee.DataCollection(iterable: Iterable)[source]¶

Bases: Iterable, DCMixins

A pythonic computation and processing framework.

DataCollection is a pythonic computation and processing framework for unstructured data in machine learning and data science. It allows a data scientist or researcher to assemble data processing pipelines and do their model work (embedding, transforming, or classification) with a method-chaining style API. It is also designed to behave as a python list or iterator. When created from a list, operations arent performed once all data has been stored from previous step. When created from an iterator, operations are performed streamwise, reading and operating on data one by one, and only progressing if its previous output has been consumed.

Examples

Create a DataCollection from list or iterator:

>>> dc = DataCollection([0, 1, 2, 3, 4])
>>> dc = DataCollection(iter([0, 1, 2, 3, 4]))

Chaining function invocations makes your code clean and fluent:

>>> (
...    dc.map(lambda x: x+1)
...      .map(lambda x: x*2)
... ).to_list()
[2, 4, 6, 8, 10]

Multi-line closures are also supported via decorator syntax:

>>> dc = DataCollection([1,2,3,4])
>>> @dc.map
... def add1(x):
...     return x+1
>>> @add1.map
... def mul2(x):
...     return x *2
>>> @mul2.filter
... def ge3(x):
...     return x>=7
>>> ge3.to_list()
[8, 10]

__init__(iterable: Iterable) → None[source]¶

Initializes a new DataCollection instance.

Parameters:: iterable (Iterable) – The iterable data that is stored in the DataCollection.

__iter__() → iter[source]¶

Generate an iterator of the DataCollection.

Returns:: iterator for the data.
Return type:: iter

__getattr__(name) → DataCollection[source]¶

Unknown method dispatcher.

When an unknown method is invoked on a DataCollection object, the function call will be dispatched to a method resolver. By registering function to the resolver, you are able to extend DataCollection’s API at runtime without modifying its code.

Parameters:

name (str) – The unknown attribute.

Returns:

Returns a new DataCollection for the output of attribute: call.

Return type:

DataCollection

Examples

>>> from towhee import register
>>> dc = DataCollection([1,2,3,4])
>>> @register(name='test/add1')
... def add1(x):
...     return x+1
>>> dc.test.add1().to_list()
[2, 3, 4, 5]

__getitem__(index) → any[source]¶

Index based access of element in DataCollection.

Access the element at the given index, similar to accessing list[at_index]. Does not work with streamed DataCollections.

Parameters:: index (int) – The index location of the element being accessed.
Raises:: TypeError – If function called on streamed DataCollection
Returns:: The object at index.
Return type:: any

Examples

Usage with non-streamed:

>>> dc = DataCollection([0, 1, 2, 3, 4])
>>> dc[2]
2

Usage with streamed:

>>> dc.stream()[1] 
Traceback (most recent call last):
TypeError: indexing is only supported for DataCollection created from list
    or pandas DataFrame.

__setitem__(index, value)[source]¶

Index based setting of element in DataCollection.

Assign the value of the element at the given index, similar to list[at_index]=val. Does not work with streamed DataCollections.

Parameters:

index (int) – The index location of the element being set.
val (any) – The value to be set.

Raises:

TypeError – If function called on streamed DataCollection

Examples

Usage with non-streamed:

>>> dc = DataCollection([0, 1, 2, 3, 4])
>>> dc[2] = 3
>>> dc.to_list()
[0, 1, 3, 3, 4]

Usage with streamed:

>>> dc.stream()[1] 
Traceback (most recent call last):
TypeError: indexing is only supported for DataCollection created from list
    or pandas DataFrame.

__add__(other) → DataCollection[source]¶

Concat two DataCollections.

Parameters:: other (DataCollection) – The DataCollection being appended to the calling DataFrame.
Returns:: A new DataCollection of the concated DataCollections.
Return type:: DataCollection

Examples

>>> dc0 = DataCollection.range(5)
>>> dc1 = DataCollection.range(5)
>>> dc2 = DataCollection.range(5)
>>> (dc0 + dc1 + dc2)
[0, 1, 2, 3, 4, 0, ...]

__repr__() → str[source]¶

String representation of the DataCollection

Returns:: String representation of the DataCollection.
Return type:: str

Examples

Usage with non-streamed:

>>> DataCollection([1, 2, 3]).unstream()
[1, 2, 3]

Usage with streamed:

>>> DataCollection([1, 2, 3]).stream() 
<list_iterator object at...>

static range(*arg, **kws) → DataCollection[source]¶

Generate DataCollection with range of values.

Generate DataCollection with a range of numbers as the data. Functions in same way as Python range() function.

Returns:: Returns a new DataCollection.
Return type:: DataCollection

Examples

>>> DataCollection.range(5).to_list()
[0, 1, 2, 3, 4]

to_list() → list[source]¶

Convert DataCollection to list.

Returns:: List of values stored in DataCollection.
Return type:: list

Examples

>>> DataCollection.range(5).to_list()
[0, 1, 2, 3, 4]

map(*arg) → DataCollection[source]¶

Apply a function across all values in a DataCollection.

Can apply multiple functions to the DataCollection. If multiple functions supplied, the same amount of new DataCollections will be returend.

Parameters:: *arg (Callable) – One or multiple functions to apply to the DataCollection.
Returns:: New DataCollection containing computation results.
Return type:: DataCollection

Examples

Single Function:

>>> dc = DataCollection([1,2,3,4])
>>> dc.map(lambda x: x+1).map(lambda x: x*2).to_list()
[4, 6, 8, 10]

Multiple Functions:

>>> dc = DataCollection([1,2,3,4])
>>> a, b = dc.map(lambda x: x+1, lambda x: x*2)
>>> (a.to_list(), b.to_list())
([2, 3, 4, 5], [2, 4, 6, 8])

filter(unary_op: Callable, drop_empty=False) → DataCollection[source]¶

Filter the DataCollection data based on function.

Filters the DataCollection based on the function provided. If data is stored as an Option (see towhee.functional.option.py), drop empty will decide whether to remove the element or set it to empty.

Parameters:

unary_op (Callable) – Function that dictates filtering.
drop_empty (bool, optional) – Whether to drop empty fields. Defaults to False.

Returns:

Resulting DataCollection after filter.

Return type:

DataCollection

run()[source]¶

Iterate through the DataCollections data.

Stream-based DataCollections will not run if the data is not a datasink. This function is a datasink that consumes the data without any operations.

to_df() → DataFrame[source]¶

Turn a DataCollection into a DataFrame.

Returns:: Resulting converted DataFrame.
Return type:: DataFrame

Examples

>>> from towhee import DataCollection, Entity
>>> e = [Entity(a=a, b=b) for a,b in zip(['abc', 'def', 'ghi'], [1,2,3])]
>>> dc = DataCollection(e)
>>> type(dc)
<class 'towhee.functional.data_collection.DataCollection'>

>>> type(dc.to_df())
<class 'towhee.functional.data_collection.DataFrame'>

__weakref__¶: list of weak references to the object (if defined)