hyrax.data_sets.data_provider
=============================

.. py:module:: hyrax.data_sets.data_provider


Attributes
----------

.. autoapisummary::

   hyrax.data_sets.data_provider.logger


Classes
-------

.. autoapisummary::

   hyrax.data_sets.data_provider.DataProvider


Functions
---------

.. autoapisummary::

   hyrax.data_sets.data_provider.generate_data_request_from_config


Module Contents
---------------

.. py:data:: logger

.. py:function:: generate_data_request_from_config(config)

   This function handles the backward compatibility issue of defining the requested
   dataset in the `[data_set]` table in the config. If a `[model_inputs]` table
   is not defined, we will assemble a `data_request` dictionary from the values
   defined elsewhere in the configuration file.

   NOTE: We should anticipate deprecating the ability to define a data_request in
   `[data_set]`, when that happens, we should be able to remove this function.

   :param config: The Hyrax configuration that can is passed to each dataset instance.
   :type config: dict

   :returns: A dictionary where keys are dataset names and values are lists of fields
   :rtype: dict


.. py:class:: DataProvider(config: dict, request: dict)

   This class presents itself as a PyTorch Dataset, but acts like a GraphQL
   gateway that fetches data from multiple datasets based on the `model_inputs`
   dictionary provided during initialization.

   This class allows for flexible data retrieval from multiple dataset classes,
   each of which can have different fields requested.

   Additionally, the user can provide specific configuration options for each
   dataset class that will be merged with the original configuration provided
   during initialization.

   Initialize the DataProvider with a Hyrax config and extract (or create)
   the data_request.

   :param config: The Hyrax configuration that defines the data_request.
   :type config: dict
   :param request: A dictionary that defines the data request.
   :type request: dict


   .. py:attribute:: config


   .. py:attribute:: data_request


   .. py:attribute:: prepped_datasets


   .. py:attribute:: dataset_getters


   .. py:attribute:: all_metadata_fields


   .. py:attribute:: requested_fields


   .. py:attribute:: custom_collate_functions


   .. py:attribute:: primary_dataset
      :value: None


   .. py:attribute:: primary_dataset_id_field_name
      :value: None


   .. py:method:: pull_up_primary_dataset_methods()

      If a primary dataset is defined, we will pull up some of its methods
      to the DataProvider level so that they can be called directly on the
      DataProvider instance.


   .. py:method:: __getitem__(idx) -> dict

      This method returns data for a given index.

      It is also a wrapper that allows this class to be treated as a PyTorch
      Dataset.

      :param idx: The index of the data item to retrieve.
      :type idx: int

      :returns: A dictionary containing the requested data from the prepared datasets.
      :rtype: dict


   .. py:method:: __len__() -> int

      Returns the length of the dataset.
      If the primary dataset is defined, it will return that length, otherwise
      it will use the length of the first dataset in ``self.prepped_datasets``.


   .. py:method:: __repr__() -> str


   .. py:method:: fields() -> dict

      Print all the available fields for each dataset in the DataProvider.

      :returns: A dictionary mapping friendly dataset names to their available fields.
      :rtype: dict


   .. py:method:: is_iterable()

      DataProvider datasets will always be map-style datasets.


   .. py:method:: is_map()

      DataProvider datasets will always be map-style datasets.


   .. py:method:: prepare_datasets()

      Instantiate each of the requested datasets based on the ``model_inputs``
      configuration dictionary. Store the prepared instances in the
      ``self.prepped_datasets`` dictionary.


   .. py:method:: _apply_configurations(base_config: dict, dataset_definition: dict) -> dict
      :staticmethod:


      Merge the original base config with the dataset-specific config.

      This function uses ``ConfigManager.merge_configs`` to merge the
      dataset-specific configuration into a copy of the original base config.

      If no ``dataset_config`` is provided in the ``dataset_definition`` dict,
      the original base config will be returned unmodified.

      Example of a dataset definition dictionary:

      .. code-block:: python

          "my_dataset": {
              "dataset_class": "MyDataset",
              "data_location": "/path/to/data",
              "dataset_config": {
                  "param1": "value1",
                  "param2": "value2"
              },
              "fields": ["field1", "field2"]
          }

      or equivalently in a .toml file:

      .. code-block:: toml

          [model_inputs]
          [model_inputs.my_dataset]
          dataset_class = "MyDataset"
          data_location = "/path/to/data"
          fields = ["field1", "field2"]
          [model_inputs.my_dataset.dataset_config]
          param1 = "value1"
          param2 = "value2"

      In this example, the ``dataset_config`` dictionary will be merged into
      the original base config, overriding the values of param1 and param2
      when creating an instance of ``MyDataset``.

      :param base_config: The original base configuration dictionary. A copy of this is created,
                          the dataset_definition dict is merged into the copy, and the copy
                          is returned.
      :type base_config: dict
      :param dataset_definition: A dictionary defining the dataset, including any dataset-specific
                                 configuration options in a nested ``dataset_config`` dictionary.
      :type dataset_definition: dict

      :returns: A final configuration dictionary to be passed when creating an instance
                of the dataset class.
      :rtype: dict


   .. py:method:: sample_data() -> dict

      Returns a data sample. Primarily this will be used for instantiating a
      model so that any runtime resizing can be handled properly.

      :returns: A dictionary containing the data for index 0.
      :rtype: dict


   .. py:method:: ids()

      Returns the IDs of the dataset.

      If the primary dataset is defined it will return those ids, if not,
      it will return the ids of the first dataset in the list of
      prepped_dataset.keys().


   .. py:method:: resolve_data(idx: int) -> dict

      This method requests the field data from the prepared datasets by index.

      :param idx: The index of the data item to retrieve.
      :type idx: int

      :returns: A dictionary containing the requested data from the prepared datasets.
      :rtype: dict


   .. py:method:: metadata(idxs=None, fields=None) -> numpy.ndarray

      Fetch the requested metadata fields for the given indices.

      Example:

      .. code-block:: python

          # Fetch the metadata_1 and metadata_2 fields from the dataset with the
          # friendly name "random_1".

          metadata = data_provider.metadata(
              idxs=[0, 1, 2],
              fields=["metadata_1_random_1", "metadata_2_random_1"]
          )

      :param idxs: A list of indices for which to fetch metadata. If None, no metadata
                   will be returned.
      :type idxs: list of int, optional
      :param fields: A list of metadata fields to fetch. If None, no metadata will be
                     returned.
      :type fields: list of str, optional

      :returns: A structured NumPy array containing the requested metadata fields.
                The dtype names of the array will be the metadata field names, modified
                to include the friendly name of the dataset they come from. For example,
                if the "RA" field comes from a dataset with the friendly name "cifar",
                the returned field name will be "RA_cifar".
      :rtype: np.ndarray


   .. py:method:: metadata_fields(friendly_name=None) -> list[str]

      Returns a list of metadata fields that are available across all prepared
      datasets.

      The field names will be modified to include the friendly name of the
      dataset they come from. For example, if the "RA" field comes from a dataset
      with the friendly name "cifar", the returned field name will be "RA_cifar".

      NOTE: If a specific dataset friendly_name is provided, only the metadata
      fields for that dataset will be returned, and the field names will not
      include the friendly name suffix.

      :param friendly_name: If provided, only the metadata fields for the specified friendly name
                            will be returned. If not provided, metadata fields from all datasets
                            will be returned.
      :type friendly_name: str, optional

      :returns: The column names of the metadata table passed. Empty list if no metadata
                was provided during construction of the DataProvider.
      :rtype: list[str]


   .. py:method:: _primary_or_first_dataset()

      Returns the primary dataset instance if it exists, otherwise returns
      the first dataset in the prepped_datasets.


   .. py:method:: collate(batch: list[dict]) -> dict

      Custom collate function to be used outside the context of a PyTorch
      DataLoader.

      This function takes a list of data samples (each sample is a dictionary)
      and combines them into a single batch dictionary.

      :param batch: A list of data samples, where each sample is a dictionary.
      :type batch: list of dict

      :returns: A dictionary where each key corresponds to a field and the value is
                a list of values for that field across the batch.
      :rtype: dict