Croissant

Namespace: http://mlcommons.org/croissant/

Classes

ContexExtractionEnumeration Class

URI: http://mlcommons.org/croissant/ContentExtractionEnumeration

Description: Specifies which content to extract from a file. One of "all", "lines", or "lineNumbers".

Subclass of:

DataSource Class

URI: http://mlcommons.org/croissant/DataSource

Description: A source of data, optionally transformed before being used.

Subclass of:

Properties:

Documentation

DataSource is the class describing the data that can be extracted from files to populate a RecordSet. This class should be used when the data coming from the source needs to be transformed or formatted to be included in the ML dataset; otherwise a simple Reference can be used instead to point to the source.

DataSource is a subclass of schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
fileObject Reference ONE The name of the referenced FileObject source of the data
fileSet Reference ONE The name of the referenced FileSet source of the data
recordSet Reference ONE The name of the referenced RecordSet source
extract Extract ONE The extraction method from the provided source
transform Transform MANY A transformation to apply on source data on top of the extracted method as specified through extract, e.g., a regular expression or JSON query
format Format ONE A format to parse the values of the data from text, e.g., a date format or number format

Usage

DataSource is used within Field definitions to specify where the data for the field comes from and how it should be processed. The source can be a FileObject, FileSet, or another RecordSet, and the data can be extracted and transformed using the extract, transform, and format properties.

Example

{
  "source": {
    "fileSet": { "@id": "image-files" },
    "extract": {
      "fileProperty": "filename"
    },
    "transform": {
      "regex": "([^\\/]*)\\.jpg"
    }
  }
}

This example extracts filenames from a set of image files and applies a regular expression transformation to extract just the base filename without the path and extension.

DataType Class

URI: http://mlcommons.org/croissant/DataType

Description: The data type of values expected for a Field in a RecordSet. This class is inspired by the Datatype class in CSVW. In addition to simple atomic types, types can be semantic types, such as schema.org classes, as well types defined in other vocabularies.

Subclass of:

Documentation

The data type of values expected for a Field in a RecordSet. This class is inspired by the Datatype class in CSVW. In addition to simple atomic types, types can be semantic types, such as schema.org classes, as well types defined in other vocabularies.

Key Features

  • A field may have more than a single assigned dataType, in which case at least one must be an atomic data type (e.g.: sc:Text), while other types can provide more semantic information, possibly in the context of ML.
  • Can be specified at two levels: on individual Fields and on entire RecordSets.

Atomic Data Types

dataType Usage
sc:Boolean Describes a boolean
sc:Date Describes a date
sc:Float Describes a float
sc:Integer Describes an integer
sc:Text Describes a string

ML-Specific Data Types

dataType Usage
sc:ImageObject Describes a field containing the content of an image (pixels)
cr:BoundingBox Describes the coordinates of a bounding box (4-number array)
cr:Split Describes a RecordSet used to divide data into multiple sets according to intended usage with regards to models

Using Data Types from Other Vocabularies

Croissant datasets can use data types from other vocabularies, such as Wikidata. These may be supported by the tools consuming the data, but don't need to. For example:

dataType Usage
wd:Q48277 (gender) Describes a Field or a RecordSet whose values are indicative of someone's gender

Examples

Simple Field Type

{
  "@id": "images/color_sample",
  "@type": "cr:Field",
  "dataType": "sc:ImageObject"
}

Multiple Data Types

{
  "@id": "cities/url",
  "@type": "cr:Field",
  "dataType": ["https://schema.org/URL", "https://www.wikidata.org/wiki/Q515"]
}

This example shows a field that is expected to be a URL, whose semantic type is City, so values will be URLs referring to cities.

Extract Class

URI: http://mlcommons.org/croissant/Extract

Description: Specifies how to extract data from a DataSource. The extraction mechanism depends on the type of content, e.g., a column name for tabular data, or a jsonPath for JSON data.

Subclass of:

Properties:

Documentation

Sometimes, not all the data from the source is needed, but only a subset. The Extract class can be used to specify how to do that, depending on the type of the data.

Extraction Methods

Source type Property Expected property value Result
FileObject or FileSet fileProperty One of: fullpath, filename, content, lines, lineNumbers The corresponding property for the FileObject
CSV (FileObject) column A column name Values in the specified column
JSON jsonPath A JSONPath expression The value(s) obtained by evaluating the JSON path expression

FileProperty Values

  • fullpath: The full path to the file within the Croissant extraction or download folders. Example: data/train/metadata.csv
  • filename: The name of the file. In data/train/metadata.csv, the file name is metadata.csv
  • content: The byte content of the file
  • lines: The byte content of each line in the file
  • lineNumbers: The number of each line in the file (starting from 0)

Examples

Extracting File Content

{
  "extract": {
    "fileProperty": "content"
  }
}

Extracting CSV Column

{
  "extract": {
    "column": "userId"
  }
}

Extracting with JSONPath

{
  "extract": {
    "jsonPath": "$.metadata.title"
  }
}

Extracting Filename

{
  "extract": {
    "fileProperty": "filename"
  }
}

This class is typically used within a DataSource to specify exactly what part of the source data should be extracted for a particular field.

Field Class

URI: http://mlcommons.org/croissant/Field

Description: A component of the structure of a RecordSet, such as a column of a table.

Subclass of:

Properties:

Documentation

A Field is part of a RecordSet. It may represent a column of a table, or a nested data structure or even a nested RecordSet in the case of hierarchical data.

Field is a subclass of schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
source DataSource
URL
ONE The data source of the field. This will generally reference a FileObject or FileSet's contents
dataType DataType MANY The data type of the field, identified by the URI of the corresponding class
repeated Boolean ONE If true, then the Field is a list of values of type dataType
equivalentProperty URL MANY A property that is equivalent to this Field
references Reference MANY Another Field of another RecordSet that this field references (foreign key equivalent)
subField Field MANY Another Field that is nested inside this one
parentField Reference MANY A special case of SubField that should be hidden because it references a Field that already appears in the RecordSet

Key Features

  • Each field has a name (unique identifier within the RecordSet)
  • Supports foreign key relationships through the references property
  • Supports hierarchical nesting with subField and parentField
  • Can specify multiple data types for semantic enrichment

Examples

Simple Field

{
  "@type": "cr:Field",
  "@id": "ratings/user_id",
  "dataType": "sc:Integer",
  "source": {
    "fileObject": { "@id": "ratings-table" },
    "extract": {
      "column": "userId"
    }
  }
}

Field with Reference (Foreign Key)

{
  "@type": "cr:Field",
  "@id": "ratings/movie_id",
  "dataType": "sc:Integer",
  "source": {
    "fileObject": { "@id": "ratings-table" },
    "extract": {
      "column": "movieId"
    }
  },
  "references": {
    "@id": "movies/movie_id"
  }
}

Nested Field with SubFields

{
  "@type": "cr:Field",
  "@id": "gps_coordinates",
  "description": "GPS coordinates where the image was taken.",
  "dataType": "sc:GeoCoordinates",
  "subField": [
    {
      "@type": "cr:Field",
      "@id": "gps_coordinates/latitude",
      "dataType": "sc:Float",
      "source": {
        "fileObject": { "@id": "metadata" },
        "extract": { "column": "latitude" }
      }
    },
    {
      "@type": "cr:Field",
      "@id": "gps_coordinates/longitude",
      "dataType": "sc:Float",
      "source": {
        "fileObject": { "@id": "metadata" },
        "extract": { "column": "longitude" }
      }
    }
  ]
}

This example shows how fields can be hierarchically structured to represent complex data types like geographical coordinates.

FileObject Class

URI: http://mlcommons.org/croissant/FileObject

Description: An individual file that is part of a dataset.

Subclass of:

Properties:

Documentation

FileObject is the Croissant class used to represent individual files that are part of a dataset.

FileObject is a general purpose class that inherits from Schema.org CreativeWork, and can be used to represent instances of more specific types of content like DigitalDocument and MediaObject.

Most of the important properties needed to describe a FileObject are defined in the classes it inherits from:

Property ExpectedType Cardinality Description
sc:name Text ONE The name of the file. As much as possible, the name should reflect the name of the file as downloaded, including the file extension. e.g. "images.zip".
sc:contentUrl URL ONE Actual bytes of the media object, for example the image file or video file.
sc:contentSize Text ONE File size in (mega/kilo/…)bytes. Defaults to bytes if a unit is not specified.
sc:encodingFormat Text ONE The format of the file, given as a mime type.
sc:sameAs URL MANY URL (or local name) of a FileObject with the same content, but in a different format.
sc:sha256 Text ONE Checksum for the file contents.

In addition, FileObject defines the following property:

Property ExpectedType Cardinality Description
containedIn Text MANY Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object.

Let's look at a few examples of FileObject definitions.

First, a single CSV file:

{
  "@type": "cr:FileObject",
  "@id": "pass_metadata.csv",
  "contentUrl": "https://zenodo.org/record/6615455/files/pass_metadata.csv",
  "encodingFormat": "text/csv",
  "sha256": "0b033707ea49365a5ffdd14615825511"
}

Next: An archive and some files extracted from it (represented via the containedIn property):

{
  "@type": "cr:FileObject",
  "@id": "ml-25m.zip",
  "contentUrl": "https://files.grouplens.org/datasets/movielens/ml-25m.zip",
  "encodingFormat": "application/zip",
  "sha256": "6b51fb2759a8657d3bfcbfc42b592ada"
},
{
  "@type": "cr:FileObject",
  "@id": "ratings-table",
  "contentUrl": "ratings.csv",
  "containedIn": { "@id": "ml-25m.zip" },
  "encodingFormat": "text/csv"
},
{
  "@type": "cr:FileObject",
  "@id": "movies-table",
  "contentUrl": "movies.csv",
  "containedIn": { "@id": "ml-25m.zip" },
  "encodingFormat": "text/csv"
}

FilePropertyEnumeration Class

URI: http://mlcommons.org/croissant/FilePropertyEnumeration

Description: Specifies a property of a FileObject. One of "fullPath" or "fileName".

Subclass of:

FileSet Class

URI: http://mlcommons.org/croissant/FileSet

Description: A set of homogeneous files extracted from a container, optionally filtered by inclusion and/or exclusion filters.

Subclass of:

Properties:

Documentation

In many datasets, data comes in the form of collections of homogeneous files, such as images, videos or text files, where each file needs to be treated as an individual item, e.g., as a training example. FileSet is a class that describes such collections of files.

A FileSet is a set of files located in a container, which can be an archive FileObject or a "manifest" file. A FileSet may also specify inclusion / exclusion filters using file patterns.

FileSet extends schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
containedIn Reference MANY The source of data for the FileSet, e.g., an archive. If multiple values are provided, then the union of their contents is taken
includes Text MANY A glob pattern that specifies the files to include
excludes Text MANY A glob pattern that specifies the files to exclude

Pattern Processing

The includes and excludes properties use glob patterns, a common mechanism to specify a set of files along a path, like ".jpg" for all jpg images, or "/foo/pic.jpg" for all jpg images under the "foo" directory whose filename starts with "pic".

To get the set of FileObjects included in the FileSet: 1. The includes pattern(s) are evaluated first 2. If multiple includes are specified, the union of their results is taken 3. Then all the files corresponding to the excludes patterns are removed from that set 4. Patterns are evaluated from the root of the containedIn contents (e.g., the top level directory extracted from an archive)

Examples

Simple Image Archive

{
  "@type": "cr:FileObject",
  "@id": "train2014.zip",
  "contentSize": "13510573713 B",
  "contentUrl": "http://images.cocodataset.org/zips/train2014.zip",
  "encodingFormat": "application/zip",
  "sha256": "sha256"
},
{
  "@type": "cr:FileSet",
  "@id": "image-files",
  "containedIn": { "@id": "train2014.zip" },
  "encodingFormat": "image/jpeg",
  "includes": "*.jpg"
}

Complex Archive with Multiple FileSets

{
  "@type": "cr:FileObject",
  "@id": "flores200_dataset.tar.gz",
  "description": "Flores 200 is hosted on a webserver.",
  "contentSize": "25585843 B",
  "contentUrl": "https://tinyurl.com/flores200dataset",
  "encodingFormat": "application/x-gzip",
  "sha256": "c764ffdeee4894b3002337c5b1e70ecf6f514c00"
},
{
  "@type": "cr:FileSet",
  "@id": "files-dev",
  "description": "dev files are inside the tar.",
  "containedIn": { "@id": "flores200_dataset.tar.gz" },
  "encodingFormat": "application/json",
  "includes": "flores200_dataset/dev/*.dev"
},
{
  "@type": "cr:FileSet",
  "@id": "files-devtest",
  "description": "devtest files are inside the tar.",
  "containedIn": { "@id": "flores200_dataset.tar.gz" },
  "encodingFormat": "application/json",
  "includes": "flores200_dataset/devtest/*.devtest"
}

This example shows how multiple FileSets can be extracted from a single archive, each with different inclusion patterns to select different subsets of files.

Format Class

URI: http://mlcommons.org/croissant/Format

Description: Specifies how to parse the format of the data from a string representation. For example, format may hold a date format string, a number format, or a bounding box format.

Subclass of:

Documentation

A format string used to parse the values coming from a DataSource. For example, a date may be represented as the string "2022/11/10", and interpreted into the correct date via the format "yyyy/MM/dd". Formats correspond to a target data type.

Supported Format Types

Data types Format Example
sc:Date
sc:DateTime
CLDR Date/Time Patterns MM/dd/yyyy
sc:Number
sc:Float
sc:Integer
CLDR Number and Currency patterns 0.##E0 (scientific notation with max 2 decimals)
cr:BoundingBox Keras bounding box format CENTER_XYWH

Note: This list is not exhaustive, and not all Croissant implementations will support all formats.

Examples

Date Format Parsing

{
  "source": {
    "fileObject": { "@id": "metadata" },
    "extract": { "column": "datetaken" },
    "format": "%Y-%m-%d %H:%M:%S.%f"
  }
}

Bounding Box Format

{
  "@type": "cr:Field",
  "@id": "images/annotations/bbox",
  "description": "The bounding box around annotated object[s].",
  "dataType": "cr:BoundingBox",
  "source": {
    "fileSet": { "@id": "instancesperson_keypoints_annotations" },
    "extract": { "column": "bbox" },
    "format": "CENTER_XYWH"
  }
}

Usage

Format specifications are typically used within DataSource definitions to ensure that string representations of structured data (like dates, numbers, or coordinates) are correctly parsed into their intended data types. This is particularly important for ML datasets where precise data interpretation is crucial for model training and evaluation.

RecordSet Class

URI: http://mlcommons.org/croissant/RecordSet

Description: A description of a set of structured records from one or more data sources and their structure, expressed as a set of fields.

Subclass of:

Properties:

  • data (→ 22-rdf-syntax-ns#JSON)
  • dataType (→ DataType)
  • examples (→ 22-rdf-syntax-ns#JSON)
  • field (→ Field)
  • key (→ Field)
  • source (→ DataSource, FileObject, FileSet, RecordSet)

Documentation

A RecordSet describes a set of structured records obtained from one or more data sources (typically a file or set of files) and the structure of these records, expressed as a set of fields (e.g., the columns of a table). A RecordSet can represent flat or nested data.

Purpose

RecordSet provides a common structure description that can be used across different modalities, in terms of records that may contain multiple fields. It handles:

  • Unstructured content (like text and images) as single-field records
  • Tabular data as one record per row in the table, with fields for each column
  • Tree-structured data with nested and repeated fields

RecordSet is a subclass of schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
field Field MANY A data element that appears in the records of the RecordSet (e.g., one column of a table)
key Text MANY One or more fields whose values uniquely identify each record in the RecordSet
data JSON MANY One or more records that constitute the data of the RecordSet
examples JSON
URL
MANY One or more records provided as example content of the RecordSet, or a reference to data source that contains examples

Additional Features

  • Embedding: Supports embedding small enumerations directly via the data property
  • Typing: Supports typing with dataType for entire RecordSets
  • Joins: Supports joins through field references (foreign keys)
  • Hierarchical: Supports hierarchical structures with nested records

Examples

Simple Tabular RecordSet

{
  "@type": "cr:RecordSet",
  "@id": "ratings",
  "key": [{ "@id": "ratings/user_id" }, { "@id": "ratings/movie_id" }],
  "field": [
    {
      "@type": "cr:Field",
      "@id": "ratings/user_id",
      "dataType": "sc:Integer",
      "source": {
        "fileObject": { "@id": "ratings-table" },
        "extract": { "column": "userId" }
      }
    },
    {
      "@type": "cr:Field",
      "@id": "ratings/rating",
      "description": "The score of the rating on a five-star scale.",
      "dataType": "sc:Float",
      "source": {
        "fileObject": { "@id": "ratings-table" },
        "extract": { "column": "rating" }
      }
    }
  ]
}

Enumeration with Embedded Data

{
  "@type": "cr:RecordSet",
  "@id": "gender_enum",
  "description": "Maps gender ids (0, 1) to labeled values.",
  "key": { "@id": "gender_enum/id" },
  "field": [
    { "@id": "gender_enum/id", "@type": "cr:Field", "dataType": "sc:Integer" },
    { "@id": "gender_enum/label", "@type": "cr:Field", "dataType": "sc:String" }
  ],
  "data": [
    { "gender_enum/id": 0, "gender_enum/label": "Male" },
    { "gender_enum/id": 1, "gender_enum/label": "Female" }
  ]
}

Geographic Data with Type Mapping

{
  "@id": "cities",
  "@type": "cr:RecordSet",
  "dataType": "sc:GeoCoordinates",
  "field": [
    {
      "@id": "cities/latitude",
      "@type": "cr:Field"
    },
    {
      "@id": "cities/longitude", 
      "@type": "cr:Field"
    }
  ]
}

This example shows how RecordSets can be typed with semantic types like sc:GeoCoordinates, and fields can be implicitly mapped to properties of that type (latitude and longitude).

Transform Class

URI: http://mlcommons.org/croissant/Transform

Description: Specifies how to transform data extracted from a DataSource. The type of transformation depends on the type of content, e.g., a regular expression to appy on text, or a jsonQuery to transform JSON content.

Subclass of:

Properties:

Documentation

Croissant supports a few simple transformations that can be applied on the source data. Transformations are used to modify extracted data before it's included in the final dataset.

Supported Transformations

  • delimiter: Split a string into an array using the supplied character
  • regex: A regular expression to parse the data
  • jsonQuery: A JSON query to evaluate on the (JSON) data source

Examples

Regular Expression Transformation

{
  "fileSet": {
    "@id": "files"
  },
  "extract": {
    "fileProperty": "filename"
  },
  "transform": {
    "regex": "^(train|val|test)2014/.*\\.jpg$"
  }
}

This example extracts filenames and applies a regex to parse training/validation/test split information.

Filename Parsing

{
  "source": {
    "fileSet": { "@id": "image-files" },
    "extract": {
      "fileProperty": "filename"
    },
    "transform": {
      "regex": "([^\\/]*)\\.jpg"
    }
  }
}

This extracts the base filename (without path and extension) from image files.

Delimiter Transformation

{
  "transform": {
    "delimiter": ","
  }
}

This would split a comma-separated string into an array of values.

JSON Query Transformation

{
  "transform": {
    "jsonQuery": "$.metadata.authors[*].name"
  }
}

This would extract all author names from a JSON structure using a JSON query.

Usage

Transformations are typically used within DataSource definitions, applied after data extraction but before final formatting. They provide a way to clean, parse, or restructure data to make it suitable for machine learning workflows without requiring external preprocessing steps.

Properties

citeAs Property

URI: http://mlcommons.org/croissant/citeAs

Description: How to cite this dataset. Ideally, citations should be expressed using the bibtex format. Note that this is different from schema.org/citation, which is used to make a citation to another publication from this dataset.

Domain:

Range:

column Property

URI: http://mlcommons.org/croissant/column

Description: In case the data source is tabular, the id of a column to extract.

Domain:

Range:

containedIn Property

URI: http://mlcommons.org/croissant/containedIn

Description: Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object.

Domain:

Range:

content Property

URI: http://mlcommons.org/croissant/content

Description: What to extract from the data source content, e.g., lines.

Domain:

Range:

data Property

URI: http://mlcommons.org/croissant/data

Description: One or more inlined records that constitute the data of the RecordSet, typically used for small enumeration values.

Domain:

Range:

dataType Property

URI: http://mlcommons.org/croissant/dataType

Description: The data type of the field, identified by the URI of the corresponding class. It could be either an atomic type (e.g, sc:Integer) or a semantic type (e.g., sc:GeoLocation).

Domain:

Range:

delimiter Property

URI: http://mlcommons.org/croissant/delimiter

Description: A delimiter to use parse the data into an array.

Domain:

Range:

equivalentProperty Property

URI: http://mlcommons.org/croissant/equivalentProperty

Description: A property that is equivalent to this Field. Used in the case a dataType is specified on the RecordSet to map specific fields to specific properties associated with that dataType.

Domain:

Range:

examples Property

URI: http://mlcommons.org/croissant/examples

Description: One more inlined records provided as example content of the RecordSet.

Domain:

Range:

excludes Property

URI: http://mlcommons.org/croissant/excludes

Description: A glob pattern that specifies the files to exclude. The pattern is evaluated from the root of the containedIn contents, after the includes patterns have been evaluated.

Domain:

Range:

extract Property

URI: http://mlcommons.org/croissant/extract

Description: The extraction method from the provided source.

Domain:

Range:

field Property

URI: http://mlcommons.org/croissant/field

Description: A data element that appears in the records of the RecordSet (e.g., one column of a table).

Domain:

Range:

fileObject Property

URI: http://mlcommons.org/croissant/fileObject

Description: The id of a FileObject that is the source of the data.

Domain:

Range:

fileProperty Property

URI: http://mlcommons.org/croissant/fileProperty

Description: The file property to extract from the data source metadata, e.g., the filename.

Domain:

Range:

fileSet Property

URI: http://mlcommons.org/croissant/fileSet

Description: The id of a FileSet that is the source of the data.

Domain:

Range:

format Property

URI: http://mlcommons.org/croissant/format

Description: A format to parse the values of the data from text, e.g., a date format or number format.

Domain:

Range:

includes Property

URI: http://mlcommons.org/croissant/includes

Description: A glob pattern that specifies the files to include, e.g., ".jpg", "/foo/pic*.jpg". The pattern is evaluated from the root of the containedIn contents.

Domain:

Range:

isLiveDataset Property

URI: http://mlcommons.org/croissant/isLiveDataset

Description: Indicates that the dataset is continuously updated instead of being versioned.

Domain:

Range:

jsonPath Property

URI: http://mlcommons.org/croissant/jsonPath

Description: In case the data source is JSON data, a path expression to extract a subset of the data.

Domain:

Range:

jsonQuery Property

URI: http://mlcommons.org/croissant/jsonQuery

Description: For JSON content, a query to evaluate on the data.

Domain:

Range:

key Property

URI: http://mlcommons.org/croissant/key

Description: One or more fields whose values uniquely identify each record in the RecordSet. (See example below.)

Domain:

Range:

parentField Property

URI: http://mlcommons.org/croissant/parentField

Description: A special case of SubField that should be hidden because it references a Field that already appears in the RecordSet.

Domain:

Range:

recordSet Property

URI: http://mlcommons.org/croissant/recordSet

Description: The id of a RecordSet that is the source of the data.

Domain:

Range:

references Property

URI: http://mlcommons.org/croissant/references

Description: Another Field of another RecordSet that this field references. This is the equivalent of a foreign key reference in a relational database.

Domain:

Range:

regex Property

URI: http://mlcommons.org/croissant/regex

Description: A regular expression to apply to the data.

Domain:

Range:

repeated Property

URI: http://mlcommons.org/croissant/repeated

Description: If true, then the Field is a list of values of type dataType.

Domain:

Range:

source Property

URI: http://mlcommons.org/croissant/source

Description: The data source of the field. This will generally reference a FileObject or FileSet's contents (e.g., a specific column of a table).

Domain:

Range:

subField Property

URI: http://mlcommons.org/croissant/subField

Description: Another Field that is nested inside this one.

Domain:

Range:

transform Property

URI: http://mlcommons.org/croissant/transform

Description: A transformation to apply on source data on top of the extracted method as specified through extract, e.g., a regular expression or JSON query.

Domain:

Range: