Configuration
Hyrax ships with a complete default configuration file that can be used immediately to run the software, however, to make the most of Hyrax you’ll need to modify the configuration to suit your specific needs.
Using the configuration system
When creating an instance of a Hyrax object in a notebook or running hyrax
from the command line, the configuration is the primary method for specifying the parameters.
If no configuration file is specified, the default
will be used. To specify a different configuration file, use the
-c | --runtime-config flag from the CLI
or pass the path to the configuration file when creating a Hyrax object.
from hyrax import Hyrax
# Create an instance of the Hyrax object
f = Hyrax(config_file=<path_to_config_file.toml>)
# Train the model specified in the configuration file
f.train()
>> hyrax train -c <path_to_config_file.toml>
Your first custom configuration
You could create a copy of the entire default configuration file and modify it to suit your needs, however that’s typically not required because often there are only a few parameters that must be updated for any given Hyrax action.
If a specific configuration file is provided, Hyrax will combine it with the default configuration and overwrite the default values with the specific ones.
For example, if a file called my_config.toml had the following contents:
1[general]
2 log_level = "debug"
It could be used to override the default log_level configuration, while leaving
the rest of the configuration unchanged.
from hyrax import Hyrax
# Create an instance of the Hyrax object
f = Hyrax(config_file=my_config.toml)
# Train the model specified in the configuration file
f.train()
>> hyrax train -c my_config.toml
Updating settings in a notebook
Additionally, Hyrax supports modification of the configuration interactively in a notebook.
from hyrax import Hyrax
# Create a Hyrax instance, implicitly using the default configuration
f = Hyrax()
# Set the log level for the Hyrax instance config
f.config['general']['log_level'] = 'debug'
# Train the model specified in the configuration file
f.train()
Immutable configurations
Once Hyrax begins running an action, the configuration becomes immutable. This means that the configuration cannot be changed during the execution of an action, and attempting to do so in code will raise an exception.
By making the configuration immutable during execution, we ensure that the state of all parameters can be accurately saved with the results of the action.
About the default configuration
The default configuration for Hyrax contains safe default values for all of the settings that Hyrax uses. A portion of the default configuration file is shown below.
Note
Only the first portion of the default configuration file is shown below. The entire file can be found at the bottom of the page here: Complete default configuration file.
1[general]
2# Set to `true` during development to skip checking for default config values
3# in external libraries. Use `false` otherwise.
4dev_mode = false
5
6# Destination of log messages. Options: 'stderr', 'stdout' specify the console,
7# "path/to/hyrax.log" specifies a file.
8log_destination = "stderr"
9
10# Lowest log level to emit. Options: "critical", "error", "warning", "info", "debug".
11log_level = "info"
12
13# Directory where data is stored.
14data_dir = "./data"
15
16#Top level directory for writing results.
17results_dir = "./results"
18
19
20[download]
21# Cut out width in arcseconds.
22sw = "22asec"
23
24# Cut out height in arcseconds.
25sh = "22asec"
There is a lot of information there, but don’t worry, we’ll break it down for you.
First, the file formatted using TOML for its easy readability and because it is one of the few markdown languages that natively support comments. TOML files are organized into “tables”, and each table contains one or more key/value pairs.
For instance the [general] table (the first table in the config)
contains several keys including log_level and results_dir.
Each of those keys has a value associated with it.
e.g. log_level = "info".
Second, every key has an associated comment describing what the key does. We attempt to keep the comments as concise as possible.
Finally, the configuration file is organized into tables that roughly correspond
to the different actions that Hyrax can take.
For instance, the [train] table contains parameters needed when training a
model such as epochs and weights_filename.
While the [infer] table contains keys such as model_weights_file.
Complete default configuration file
1[general]
2# Set to `true` during development to skip checking for default config values
3# in external libraries. Use `false` otherwise.
4dev_mode = false
5
6# Destination of log messages. Options: 'stderr', 'stdout' specify the console,
7# "path/to/hyrax.log" specifies a file.
8log_destination = "stderr"
9
10# Lowest log level to emit. Options: "critical", "error", "warning", "info", "debug".
11log_level = "info"
12
13# Directory where data is stored.
14data_dir = "./data"
15
16#Top level directory for writing results.
17results_dir = "./results"
18
19
20[download]
21# Cut out width in arcseconds.
22sw = "22asec"
23
24# Cut out height in arcseconds.
25sh = "22asec"
26
27# The filters to download.
28filter = ["HSC-G", "HSC-R", "HSC-I", "HSC-Z", "HSC-Y"]
29
30# The type of data to download.
31type = "coadd"
32
33# The data release to download from.
34rerun = "pdr3_wide"
35
36# Path to credentials.ini file for the downloader. File contents should be:
37# username = "<your username>"
38# password = "<your password>"
39credentials_file = "./credentials.ini"
40
41# Alternate way to pass credentials to the downloader. Users should prefer a
42# credentials.ini file to avoid exposing credentials with source control.
43username = false
44password = false
45
46# The number of sources to download from the catalog. Default is -1, which
47# downloads all sources in the catalog.
48num_sources = -1
49
50# The number of concurrent connections to use when downloading data.
51concurrent_connections = 4
52
53# The number of seconds between printing download statistics.
54stats_print_interval = 60
55
56# The path to the catalog file that defines which cutouts to download.
57fits_file = "./catalog.fits"
58
59# The number of seconds to wait before retrying a failed HTTP request in seconds.
60retry_wait = 30
61
62# How many times to retry a failed HTTP request before moving on to the next one.
63retries = 3
64
65# Number of seconds to wait for a full HTTP response from the server.
66timeout = 3600
67
68# The number of sky location rectangles should we request in a single request.
69chunk_size = 990
70
71# Request the image layer from the cutout service
72image = true
73
74# Request the variance layer from the cutout service
75variance = false
76
77# Request the mask layer from the cutout service
78mask = false
79
80
81[model]
82# NOTE: All parameters are NOT used by all models. Check the model code before training.
83
84# The name of the model to use. Option are a built-in model class name or import path
85# to an external model. e.g. "HyraxAutoencoder", "user_pkg.model.ExternalModel"
86name = ""
87
88
89[model.HyraxAutoencoder]
90# The number of output channels from the first layer.
91base_channel_size = 32
92
93# The length of the latent space vector.
94latent_dim = 64
95
96
97[model.HyraxAutoencoderV2]
98# The number of output channels from the first layer.
99base_channel_size = 32
100
101# The length of the latent space vector.
102latent_dim = 64
103
104# The activation function of the final layer.
105final_layer = "tanh"
106
107[model.SimCLR]
108# The dimension of the projection head for SimCLR
109projection_dimension = 128
110
111# The scalar temperature parameter for its loss function, NTXentLoss, for SimCLR
112temperature = 0.5
113
114# The probability of applying horizontal flip augmentation for SimCLR
115horizontal_flip_probability = 0.5
116
117# The parameters for color jitter augmentation for SimCLR
118# [brightness, contrast, saturation, hue]
119color_jitter_params = [0.8, 0.8, 0.8, 0.2]
120
121# The probability of applying color jitter augmentation for SimCLR
122color_jitter_probability = 0.8
123
124# The probability of applying grayscale augmentation for SimCLR
125grayscale_probability = 0.2
126
127# The kernel size of Gaussian blur augmentation for SimCLR
128gaussian_blur_kernel_size = 9
129
130# The sigma range used in Gaussian blur augmentation for SimCLR
131gaussian_blur_sigma_range = [0.1, 2.0]
132
133
134[model.HyraxCNN]
135# The number of classes to predict as the output of the model. i.e. 2 would be a
136# binary classifer, 10 would predict the 10 classes in the CiFAR dataset.
137output_classes = 10
138
139
140[criterion]
141# The name of the built-in criterion to use or the import path to an external criterion
142name = "torch.nn.CrossEntropyLoss"
143
144# Whether to "sum" or "mean" loss across channels. Only used by HyraxAutoencoderV2
145band_loss_reduction = "mean"
146
147
148[optimizer]
149# The name of the built-in optimizer to use or the import path to an external optimizer
150name = "torch.optim.SGD"
151
152
153["torch.optim.SGD"]
154# learning rate for torch.optim.SGD optimizer.
155lr = 0.01
156
157# momentum for torch.optim.SGD optimizer.
158momentum = 0.9
159
160["torch.optim.Adam"]
161# learning rate for torch.optim.SGD optimizer.
162lr = 0.01
163
164
165[train]
166# The name of the file were the model weights will be saved after training.
167weights_filename = "example_model.pth"
168
169#The number of epochs to train for.
170epochs = 10
171
172# If resuming from a check point, set to the path of the checkpoint file.
173# Otherwise set to `false` to start training from the beginning.
174resume = false
175
176# The data_set split to use when training a model.
177split = "train"
178
179# The name of the experiment when logging training results to mlflow
180experiment_name = "notebook"
181
182# The name of the run when logging training results to mlflow.
183# If false, uses result directory string, <timestamp>-train-<uid>, as run name.
184run_name = false
185
186
187[onnx]
188
189# The operator set version to use when exporting a model. See the following for info:
190# https://onnxruntime.ai/docs/reference/compatibility.html#onnx-opset-support
191opset_version = 20
192
193# The directory to find input model files to convert to ONNX. ONNX-ified models
194# will be written to this directory as well.
195input_model_directory = false
196
197
198[model_inputs]
199# Top-level table that defines the dataset(s) used for training, validation, and inference.
200
201
202[data_set]
203# Warning - using data_set.name to define the dataset class will be deprecated.
204# Please use [model_inputs.<friendly_name>.dataset_class] instead. See usage above.
205name = "HyraxCifarDataSet"
206
207# Crop pixel dimensions for images, e.g., [100, 100]. If false, scans for the
208# smallest image size in [general].data_dir and uses it.
209crop_to = false
210
211# Used by HSCDataSet, LSSTDataset, and DownloadedLSSTDataset.
212# Limit to only particular filters. When `false`, use all filters.
213# Options: ["HSC-G", "HSC-R", "HSC-I", "HSC-Z", "HSC-Y"] for HSC
214# Options: ["u", "g", "r", "i", "z" , "y"] for LSST
215filters = false
216
217# Path to a fits file that specifies object IDs to use from the data stored in
218# [general].data_dir. Implementation is data_set class dependent. Use `false` for no filtering.
219filter_catalog = false
220
221# The transformation to be applied to images before being passed on to the model
222# This must be a valid Numpy function. Passing false will result in no transformations
223# (other than cropping) be applied to the images.
224transform = "tanh"
225
226# train_size, validation_size, and test_size use these conventions:
227# * A `float` between `0.0` and `1.0` is the proportion of the dataset to include in the split.
228# * An `int`, represents the absolute number of samples in the particular split.
229# * It is an error for these values to add to more than 1.0 as ratios or the size
230# of the dataset if expressed as integers.
231
232# Size of the train split. If `false`, the value is automatically set to the
233# complement of test_size plus validate_size (if any).
234train_size = 0.6
235
236# Size of the validation split. If `false`, and both train_size and test_size
237# are defined, the value is set to the complement of the other two sizes summed.
238# If `false`, and only one of the other sizes is defined, no validate split is created.
239validate_size = 0.2
240
241# Size of the test split. If `false`, the value is set to the complement of train_size plus
242# the validate_size (if any). If `false` and `train_size = false`, test_size is set to `0.25`.
243test_size = 0.2
244
245# Number to seed with for generating a random split. Use `false` to seed from a
246# system source at runtime.
247seed = false
248
249# If `true`, cache samples in memory during training to reduce runtime after the
250# first epoch. Set to `false` when running inference or on memory-constrained systems.
251use_cache = true
252
253# If `true`, preload the in memory cache using many worker threads when the dataset is constructed
254# to reduce the effect of filesystem latency on first epoch runtime.
255# Warning: Only suitable for situations where the entire dataset fits in system memory
256preload_cache = true
257
258# Override the name of the object_id column for FitsImageDataset, HSCDataset and DownloadedLSSTDataset
259object_id_column_name = false
260
261# Override the name of the filter column for FitsImageDataset and HSCDataset
262filter_column_name = false
263
264# Override the name of the filename column for FitsImageDataset and HSCDataset
265filename_column_name = false
266
267# Replace NaN in input data with a value, modes are false for no replacement or "quantile" to replace with a
268# defined quantile of the non-NaN data, see nan_quantile.
269nan_mode = false
270
271# When replacing NaN values with a quantile, which quantile in the non-nan tensor should be used.
272nan_quantile = 0.05
273
274# The astropy table to use as a catalog in LSSTDataSet and friends
275astropy_table = false
276
277# Semi width in degrees of cutouts made from the butler (17 arcsec)
278semi_width_deg = 0.00472
279
280# Semi height in degrees of cutouts made from the butler (17 arcsec)
281semi_height_deg = 0.00472
282
283
284
285[data_set.HyraxRandomDataset]
286# Total number of samples produced by the random dataset
287size = 100
288
289# The dimensions of the numpy arrays that will be produced for each sample represented
290# as a list where each element is the size of dimension.
291shape = [2,5,5]
292
293# Seed to use for random number generation
294seed = 42
295
296# If a list is provided, the data will have randomly labeled with values from the list
297# If set to false, no labels will be included with the data.
298provided_labels = [0, 1, 2]
299
300# List of metadata field names. These will be populated with dummy data.
301metadata_fields = ["meta_field_1", "meta_field_2"]
302
303# Set this to a positive integer to randomly replace some values with an "invalid" value.
304number_invalid_values = 0
305
306# The value to use for invalid values in the data. Must be one of the following:
307# "nan", "inf", "-inf", "none" or a float value.
308invalid_value_type = "nan"
309
310
311[data_loader]
312# The number of data points to load at once.
313batch_size = 512
314
315# STRONG RECOMMENDATION: Leave this as `false`.
316# Ensure that the data loader does no secondary shuffling of the data.
317shuffle = false
318
319# The collate function to use in the data loader. Using `false` will use PyTorch's default collate function,
320# or specify an import path to a callable function i.e. "package.module.my_collate_fn"
321collate_fn = false
322
323
324[infer]
325# The path to the model weights file to use for inference.
326model_weights_file = false
327
328# >>> I believe that we can simply remove this entry <<<<<<<<<<<<<<<<<<<<<<<<<<<
329# The data_set split to use for inference. Use `false` for entire dataset.
330split = "infer"
331
332
333[vector_db]
334# The type of vector db to use. Use "false" to disable vector database.
335name = "chromadb"
336
337# The directory where the vector database will be stored. Use "false" to create
338# a new vector database in a timestamped directory. Otherwise set to a path.
339vector_db_dir = false
340
341# The path to inference results. Setting to "false" will use the most recent
342# inference results.
343infer_results_dir = false
344
345
346[vector_db.chromadb]
347# The approximate maximum size of a shard before creating a new one. A smaller
348# value will decrease insert times while increasing search times.
349shard_size_limit = 65536
350
351# Inserting vectors with more than this many elements logs a warning message. ChromaDB
352# performance degrades with vectors of this size. Set to "false" to disable warning.
353vector_size_warning = 10000
354
355
356[vector_db.qdrant]
357# The number of elements in the vectors that will be stored in the vector database.
358# This must be the same as the size of the vectors produced by the model.
359vector_size = 64
360
361
362[results]
363# Path to inference results to use for visualization and lookups. Uses latest inference run if none provided.
364inference_dir = false
365
366
367[umap]
368# Number of data points used to fit the umap transform.
369fit_sample_size = 1024
370
371# Save the fitted umap as a pickle file
372save_fit_umap = true
373
374# Use multiprocessing during transforming to umap space (More memory intensive)
375parallel = false
376
377# Name of the umap implementation to use
378name = "umap.UMAP"
379
380
381[umap.UMAP]
382# Specify any parameter accepted by https://umap-learn.readthedocs.io/en/latest/api.html#umap
383# Dimension of the embedded space
384n_components = 2
385
386# Controls how UMAP balances local versus global structure in the data.
387# See official documentation for details.
388n_neighbors = 15
389
390
391[visualize]
392
393# List of metadata field names to use in visualizer. Must be available as metadata in your dataset
394fields = []
395
396# Whether to display a panel of randomly chosen images corresponding to the selected points
397display_images = false
398
399# Name of catalog column to use for coloring points in the scatter plot. Use false for no coloring.
400color_column = false
401
402# Colormap to use for coloring points in the scatter plot when color_column is specified
403cmap = "viridis"
404
405# Only valid for .pt tensor images. Which bands should be loaded for display
406# [0,3,5] would map bands in that order to R,G,B. Single band will be grayscale.
407torch_tensor_bands = [3]
408
409# Whether to rasterize plot. Will break coloring (Haloviews Bug)
410# Helpful to reduce lag in large datasets.
411rasterize_plot = false
412
413
414[engine]
415
416# The directory containing the ONNX model used for inference in production
417model_directory = false