hyrax.data_sets.hyrax_csv_dataset

Classes

HyraxCSVDataset

A Hyrax Dataset for CSV files.

Module Contents

class HyraxCSVDataset(config: dict, data_location: pathlib.Path = None)[source]

Bases: hyrax.data_sets.data_set_registry.HyraxDataset

A Hyrax Dataset for CSV files. This class reads a CSV file using pandas with memory mapping enabled. It dynamically creates getter methods for each column in the CSV file, allowing users to request data from specific columns.

Note: Column names found in the CSV file are used to create the getter methods. If a column name contains characters that are invalid for method names, those characters are replaced with underscores.

Example model_inputs configuration: {

“train”: {

“data”: {
“dataset_class”: “HyraxCSVDataset”, “data_location”: </path/to/data.csv>, “fields”: [“<column1>”, “<column2>”, …], “primary_id_field”: <column name that contains a unique ID>,

},

}, “validate”: { <similar to above> }, “infer”: { <similar to above> },

}

__init__()[source]

Overall initialization for all DataSets which saves the config

Subclasses of HyraxDataSet ought call this at the end of their __init__ like:

from hyrax.data_sets import HyraxDataset
from torch.utils.data import Dataset

class MyDataset(HyraxDataset, Dataset):
    def __init__(config):
        <your code>
        super().__init__(config)

If per tensor metadata is available, it is recommended that dataset authors create an astropy Table of that data, in the same order as their data and pass that metadata_table as shown below:

from hyrax.data_sets import HyraxDataset
from torch.utils.data import Dataset
from astropy.table import Table

class MyDataset(HyraxDataset, Dataset):
    def __init__(config):
        <your code>
        metadata_table = Table(<Your catalog data goes here>)
        super().__init__(config, metadata_table)

Parameters:

config (dict, Optional) – The runtime configuration for hyrax
metadata_table (Optional[Table], optional) – An Astropy Table with 1. the metadata columns desired for visualization AND 2. in the order your data will be enumerated.
object_id_column_name (Optional[str], optional) – The name of the column containing object IDs. If None, uses the default from config or creates one from the ids() method.

data_location = None[source]

column_names[source]

mem_mapped_csv = None[source]

__getitem__(idx)[source]: Currently required by Hyrax machinery, but likely to be phased out.

__len__() → int[source]: Return the number of records in the CSV.

sample_data()[source]: Return the first record, in dictionary form, as the sample.

classmethod is_map() → bool[source]: Boilerplate method to indicate this is a map-style dataset.