Croissant supports a few simple transformations that can be applied on the source data. Transformations are used to modify extracted data before itβs included in the final dataset.
{
"fileSet": {
"@id": "files"
},
"extract": {
"fileProperty": "filename"
},
"transform": {
"regex": "^(train|val|test)2014/.*\\.jpg$"
}
}
This example extracts filenames and applies a regex to parse training/validation/test split information.
{
"source": {
"fileSet": { "@id": "image-files" },
"extract": {
"fileProperty": "filename"
},
"transform": {
"regex": "([^\\/]*)\\.jpg"
}
}
}
This extracts the base filename (without path and extension) from image files.
{
"transform": {
"delimiter": ","
}
}
This would split a comma-separated string into an array of values.
{
"transform": {
"jsonQuery": "$.metadata.authors[*].name"
}
}
This would extract all author names from a JSON structure using a JSON query.
Transformations are typically used within DataSource
definitions, applied after data extraction but before final formatting. They provide a way to clean, parse, or restructure data to make it suitable for machine learning workflows without requiring external preprocessing steps.