hyrax.data_sets.hyrax_csv_dataset
Classes
A Hyrax Dataset for CSV files. |
Module Contents
- class HyraxCSVDataset(config: dict, data_location: pathlib.Path = None)[source]
Bases:
hyrax.data_sets.data_set_registry.HyraxDatasetA Hyrax Dataset for CSV files. This class reads a CSV file using pandas with memory mapping enabled. It dynamically creates getter methods for each column in the CSV file, allowing users to request data from specific columns.
Note: Column names found in the CSV file are used to create the getter methods. If a column name contains characters that are invalid for method names, those characters are replaced with underscores.
Example model_inputs configuration: {
- “train”: {
- “data”: {
“dataset_class”: “HyraxCSVDataset”, “data_location”: </path/to/data.csv>, “fields”: [“<column1>”, “<column2>”, …], “primary_id_field”: <column name that contains a unique ID>,
},
}, “validate”: { <similar to above> }, “infer”: { <similar to above> },
}
Overall initialization for all DataSets which saves the config
Subclasses of HyraxDataSet ought call this at the end of their __init__ like:
from hyrax.data_sets import HyraxDataset from torch.utils.data import Dataset class MyDataset(HyraxDataset, Dataset): def __init__(config): <your code> super().__init__(config)
If per tensor metadata is available, it is recommended that dataset authors create an astropy Table of that data, in the same order as their data and pass that metadata_table as shown below:
from hyrax.data_sets import HyraxDataset from torch.utils.data import Dataset from astropy.table import Table class MyDataset(HyraxDataset, Dataset): def __init__(config): <your code> metadata_table = Table(<Your catalog data goes here>) super().__init__(config, metadata_table)
- Parameters:
config (dict, Optional) – The runtime configuration for hyrax
metadata_table (Optional[Table], optional) – An Astropy Table with 1. the metadata columns desired for visualization AND 2. in the order your data will be enumerated.
object_id_column_name (Optional[str], optional) – The name of the column containing object IDs. If None, uses the default from config or creates one from the ids() method.