120 |
120 |
|
pad_data_in: bool = False, |
121 |
121 |
|
**kwargs, |
122 |
122 |
|
): |
123 |
|
- |
"""Evaluates the pipeline on data_in to produce an output dataset ds_out. |
|
123 |
+ |
"""Evaluates the pipeline on ``data_in`` to produce an output dataset ``ds_out``. |
124 |
124 |
|
|
125 |
125 |
|
Args: |
126 |
126 |
|
data_in: Input passed to the transform to generate output dataset. Should support \__getitem__ and \__len__. Can be a Deep Lake dataset. |
127 |
|
- |
ds_out (Dataset, optional): The dataset object to which the transform will get written. If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised. |
128 |
|
- |
It should have all keys being generated in output already present as tensors. It's initial state should be either:- |
129 |
|
- |
- Empty i.e. all tensors have no samples. In this case all samples are added to the dataset. |
130 |
|
- |
- All tensors are populated and have sampe length. In this case new samples are appended to the dataset. |
|
127 |
+ |
ds_out (Dataset, optional): - The dataset object to which the transform will get written. If this is not provided, ``data_in`` will be overwritten if it is a Deep Lake dataset, otherwise error will be raised. |
|
128 |
+ |
- It should have all keys being generated in output already present as tensors. It's initial state should be either: |
|
129 |
+ |
- **Empty**, i.e., all tensors have no samples. In this case all samples are added to the dataset. |
|
130 |
+ |
- **All tensors are populated and have same length.** In this case new samples are appended to the dataset. |
131 |
131 |
|
num_workers (int): The number of workers to use for performing the transform. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler. |
132 |
132 |
|
scheduler (str): The scheduler to be used to compute the transformation. Supported values include: 'serial', 'threaded', 'processed' and 'ray'. |
133 |
133 |
|
Defaults to 'threaded'. |
134 |
|
- |
progressbar (bool): Displays a progress bar if True (default). |
135 |
|
- |
skip_ok (bool): If True, skips the check for output tensors generated. This allows the user to skip certain tensors in the function definition. |
136 |
|
- |
This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to False. |
137 |
|
- |
check_lengths (bool): If True, checks whether ds_out has tensors of same lengths initially. |
138 |
|
- |
pad_data_in (bool): NOTE: This is only applicable if data_in is a Deep Lake dataset. If True, pads tensors of data_in to match the length of the largest tensor in data_in. |
139 |
|
- |
Defaults to False. |
|
134 |
+ |
progressbar (bool): Displays a progress bar if ``True`` (default). |
|
135 |
+ |
skip_ok (bool): If ``True``, skips the check for output tensors generated. This allows the user to skip certain tensors in the function definition. |
|
136 |
+ |
This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to ``False``. |
|
137 |
+ |
check_lengths (bool): If ``True``, checks whether ``ds_out`` has tensors of same lengths initially. |
|
138 |
+ |
pad_data_in (bool): If ``True``, pads tensors of ``data_in`` to match the length of the largest tensor in ``data_in``. |
|
139 |
+ |
Defaults to ``False``. |
140 |
140 |
|
**kwargs: Additional arguments. |
141 |
141 |
|
|
142 |
142 |
|
Raises: |
143 |
|
- |
InvalidInputDataError: If data_in passed to transform is invalid. It should support \__getitem__ and \__len__ operations. Using scheduler other than "threaded" with deeplake dataset having base storage as memory as data_in will also raise this. |
144 |
|
- |
InvalidOutputDatasetError: If all the tensors of ds_out passed to transform don't have the same length. Using scheduler other than "threaded" with deeplake dataset having base storage as memory as ds_out will also raise this. |
|
143 |
+ |
InvalidInputDataError: If ``data_in`` passed to transform is invalid. It should support \__getitem__ and \__len__ operations. Using scheduler other than "threaded" with deeplake dataset having base storage as memory as ``data_in`` will also raise this. |
|
144 |
+ |
InvalidOutputDatasetError: If all the tensors of ``ds_out`` passed to transform don't have the same length. Using scheduler other than "threaded" with deeplake dataset having base storage as memory as ``ds_out`` will also raise this. |
145 |
145 |
|
TensorMismatchError: If one or more of the outputs generated during transform contain different tensors than the ones present in 'ds_out' provided to transform. |
146 |
146 |
|
UnsupportedSchedulerError: If the scheduler passed is not recognized. Supported values include: 'serial', 'threaded', 'processed' and 'ray'. |
147 |
147 |
|
TransformError: All other exceptions raised if there are problems while running the pipeline. |
|
148 |
+ |
|
|
149 |
+ |
Example:: |
|
150 |
+ |
|
|
151 |
+ |
@deeplake.compute |
|
152 |
+ |
def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0): |
|
153 |
+ |
samples_out.my_tensor.append(my_arg0 * my_arg1) |
|
154 |
+ |
|
|
155 |
+ |
# This transform can be used using the eval method in one of these 2 ways:- |
|
156 |
+ |
|
|
157 |
+ |
# Directly evaluating the method |
|
158 |
+ |
# here arg0 and arg1 correspond to the 3rd and 4th argument in my_fn |
|
159 |
+ |
my_fn(arg0, arg1).eval(data_in, ds_out, scheduler="threaded", num_workers=5) |
|
160 |
+ |
|
|
161 |
+ |
# As a part of a Transform pipeline containing other functions |
|
162 |
+ |
pipeline = deeplake.compose([my_fn(a, b), another_function(x=2)]) |
|
163 |
+ |
pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2) |
|
164 |
+ |
|
|
165 |
+ |
Note: |
|
166 |
+ |
``pad_data_in`` is only applicable if ``data_in`` is a Deep Lake dataset. |
|
167 |
+ |
|
148 |
168 |
|
""" |
149 |
169 |
|
num_workers, scheduler = sanitize_workers_scheduler(num_workers, scheduler) |
150 |
170 |
|
overwrite = ds_out is None |