turtle-matter

Croissant

Namespace: http://mlcommons.org/croissant/

Classes (11)

ContexExtractionEnumeration

URI: http://mlcommons.org/croissant/ContentExtractionEnumeration

Description: Specifies which content to extract from a file. One of “all”, “lines”, or “lineNumbers”.

Subclass of:


DataSource

URI: http://mlcommons.org/croissant/DataSource

Description: A source of data, optionally transformed before being used.

Subclass of:

Properties:

Documentation:

DataSource is the class describing the data that can be extracted from files to populate a RecordSet. This class should be used when the data coming from the source needs to be transformed or formatted to be included in the ML dataset; otherwise a simple Reference can be used instead to point to the source.

DataSource is a subclass of schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
fileObject Reference ONE The name of the referenced FileObject source of the data
fileSet Reference ONE The name of the referenced FileSet source of the data
recordSet Reference ONE The name of the referenced RecordSet source
extract Extract ONE The extraction method from the provided source
transform Transform MANY A transformation to apply on source data on top of the extracted method as specified through extract, e.g., a regular expression or JSON query
format Format ONE A format to parse the values of the data from text, e.g., a date format or number format

Usage

DataSource is used within Field definitions to specify where the data for the field comes from and how it should be processed. The source can be a FileObject, FileSet, or another RecordSet, and the data can be extracted and transformed using the extract, transform, and format properties.

Example

{
  "source": {
    "fileSet": { "@id": "image-files" },
    "extract": {
      "fileProperty": "filename"
    },
    "transform": {
      "regex": "([^\\/]*)\\.jpg"
    }
  }
}

This example extracts filenames from a set of image files and applies a regular expression transformation to extract just the base filename without the path and extension.


DataType

URI: http://mlcommons.org/croissant/DataType

Description: The data type of values expected for a Field in a RecordSet. This class is inspired by the Datatype class in CSVW. In addition to simple atomic types, types can be semantic types, such as schema.org classes, as well types defined in other vocabularies.

Subclass of:

Documentation:

The data type of values expected for a Field in a RecordSet. This class is inspired by the Datatype class in CSVW. In addition to simple atomic types, types can be semantic types, such as schema.org classes, as well types defined in other vocabularies.

Key Features

Atomic Data Types

dataType Usage
sc:Boolean Describes a boolean
sc:Date Describes a date
sc:Float Describes a float
sc:Integer Describes an integer
sc:Text Describes a string

ML-Specific Data Types

dataType Usage
sc:ImageObject Describes a field containing the content of an image (pixels)
cr:BoundingBox Describes the coordinates of a bounding box (4-number array)
cr:Split Describes a RecordSet used to divide data into multiple sets according to intended usage with regards to models

Using Data Types from Other Vocabularies

Croissant datasets can use data types from other vocabularies, such as Wikidata. These may be supported by the tools consuming the data, but don’t need to. For example:

dataType Usage
wd:Q48277 (gender) Describes a Field or a RecordSet whose values are indicative of someone’s gender

Examples

Simple Field Type

{
  "@id": "images/color_sample",
  "@type": "cr:Field",
  "dataType": "sc:ImageObject"
}

Multiple Data Types

{
  "@id": "cities/url",
  "@type": "cr:Field",
  "dataType": ["https://schema.org/URL", "https://www.wikidata.org/wiki/Q515"]
}

This example shows a field that is expected to be a URL, whose semantic type is City, so values will be URLs referring to cities.


Extract

URI: http://mlcommons.org/croissant/Extract

Description: Specifies how to extract data from a DataSource. The extraction mechanism depends on the type of content, e.g., a column name for tabular data, or a jsonPath for JSON data.

Subclass of:

Properties:

Documentation:

Sometimes, not all the data from the source is needed, but only a subset. The Extract class can be used to specify how to do that, depending on the type of the data.

Extraction Methods

Source type Property Expected property value Result
FileObject or FileSet fileProperty One of: fullpath, filename, content, lines, lineNumbers The corresponding property for the FileObject
CSV (FileObject) column A column name Values in the specified column
JSON jsonPath A JSONPath expression The value(s) obtained by evaluating the JSON path expression

FileProperty Values

Examples

Extracting File Content

{
  "extract": {
    "fileProperty": "content"
  }
}

Extracting CSV Column

{
  "extract": {
    "column": "userId"
  }
}

Extracting with JSONPath

{
  "extract": {
    "jsonPath": "$.metadata.title"
  }
}

Extracting Filename

{
  "extract": {
    "fileProperty": "filename"
  }
}

This class is typically used within a DataSource to specify exactly what part of the source data should be extracted for a particular field.


Field

URI: http://mlcommons.org/croissant/Field

Description: A component of the structure of a RecordSet, such as a column of a table.

Subclass of:

Properties:

Documentation:

A Field is part of a RecordSet. It may represent a column of a table, or a nested data structure or even a nested RecordSet in the case of hierarchical data.

Field is a subclass of schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
source DataSource
URL
ONE The data source of the field. This will generally reference a FileObject or FileSet’s contents
dataType DataType MANY The data type of the field, identified by the URI of the corresponding class
repeated Boolean ONE If true, then the Field is a list of values of type dataType
equivalentProperty URL MANY A property that is equivalent to this Field
references Reference MANY Another Field of another RecordSet that this field references (foreign key equivalent)
subField Field MANY Another Field that is nested inside this one
parentField Reference MANY A special case of SubField that should be hidden because it references a Field that already appears in the RecordSet

Key Features

Examples

Simple Field

{
  "@type": "cr:Field",
  "@id": "ratings/user_id",
  "dataType": "sc:Integer",
  "source": {
    "fileObject": { "@id": "ratings-table" },
    "extract": {
      "column": "userId"
    }
  }
}

Field with Reference (Foreign Key)

{
  "@type": "cr:Field",
  "@id": "ratings/movie_id",
  "dataType": "sc:Integer",
  "source": {
    "fileObject": { "@id": "ratings-table" },
    "extract": {
      "column": "movieId"
    }
  },
  "references": {
    "@id": "movies/movie_id"
  }
}

Nested Field with SubFields

{
  "@type": "cr:Field",
  "@id": "gps_coordinates",
  "description": "GPS coordinates where the image was taken.",
  "dataType": "sc:GeoCoordinates",
  "subField": [
    {
      "@type": "cr:Field",
      "@id": "gps_coordinates/latitude",
      "dataType": "sc:Float",
      "source": {
        "fileObject": { "@id": "metadata" },
        "extract": { "column": "latitude" }
      }
    },
    {
      "@type": "cr:Field",
      "@id": "gps_coordinates/longitude",
      "dataType": "sc:Float",
      "source": {
        "fileObject": { "@id": "metadata" },
        "extract": { "column": "longitude" }
      }
    }
  ]
}

This example shows how fields can be hierarchically structured to represent complex data types like geographical coordinates.


FileObject

URI: http://mlcommons.org/croissant/FileObject

Description: An individual file that is part of a dataset.

Subclass of:

Properties:

Documentation:

FileObject is the Croissant class used to represent individual files that are part of a dataset.

FileObject is a general purpose class that inherits from Schema.org CreativeWork, and can be used to represent instances of more specific types of content like DigitalDocument and MediaObject.

Most of the important properties needed to describe a FileObject are defined in the classes it inherits from:

Property ExpectedType Cardinality Description
sc:name Text ONE The name of the file. As much as possible, the name should reflect the name of the file as downloaded, including the file extension. e.g. "images.zip".
sc:contentUrl URL ONE Actual bytes of the media object, for example the image file or video file.
sc:contentSize Text ONE File size in (mega/kilo/…)bytes. Defaults to bytes if a unit is not specified.
sc:encodingFormat Text ONE The format of the file, given as a mime type.
sc:sameAs URL MANY URL (or local name) of a FileObject with the same content, but in a different format.
sc:sha256 Text ONE Checksum for the file contents.

In addition, FileObject defines the following property:

Property ExpectedType Cardinality Description
containedIn Text MANY Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object.

Let’s look at a few examples of FileObject definitions.

First, a single CSV file:

{
  "@type": "cr:FileObject",
  "@id": "pass_metadata.csv",
  "contentUrl": "https://zenodo.org/record/6615455/files/pass_metadata.csv",
  "encodingFormat": "text/csv",
  "sha256": "0b033707ea49365a5ffdd14615825511"
}

Next: An archive and some files extracted from it (represented via the containedIn property):

{
  "@type": "cr:FileObject",
  "@id": "ml-25m.zip",
  "contentUrl": "https://files.grouplens.org/datasets/movielens/ml-25m.zip",
  "encodingFormat": "application/zip",
  "sha256": "6b51fb2759a8657d3bfcbfc42b592ada"
},
{
  "@type": "cr:FileObject",
  "@id": "ratings-table",
  "contentUrl": "ratings.csv",
  "containedIn": { "@id": "ml-25m.zip" },
  "encodingFormat": "text/csv"
},
{
  "@type": "cr:FileObject",
  "@id": "movies-table",
  "contentUrl": "movies.csv",
  "containedIn": { "@id": "ml-25m.zip" },
  "encodingFormat": "text/csv"
}

FilePropertyEnumeration

URI: http://mlcommons.org/croissant/FilePropertyEnumeration

Description: Specifies a property of a FileObject. One of “fullPath” or “fileName”.

Subclass of:


FileSet

URI: http://mlcommons.org/croissant/FileSet

Description: A set of homogeneous files extracted from a container, optionally filtered by inclusion and/or exclusion filters.

Subclass of:

Properties:

Documentation:

In many datasets, data comes in the form of collections of homogeneous files, such as images, videos or text files, where each file needs to be treated as an individual item, e.g., as a training example. FileSet is a class that describes such collections of files.

A FileSet is a set of files located in a container, which can be an archive FileObject or a “manifest” file. A FileSet may also specify inclusion / exclusion filters using file patterns.

FileSet extends schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
containedIn Reference MANY The source of data for the FileSet, e.g., an archive. If multiple values are provided, then the union of their contents is taken
includes Text MANY A glob pattern that specifies the files to include
excludes Text MANY A glob pattern that specifies the files to exclude

Pattern Processing

The includes and excludes properties use glob patterns, a common mechanism to specify a set of files along a path, like “.jpg” for all jpg images, or “/foo/pic.jpg” for all jpg images under the “foo” directory whose filename starts with “pic”.

To get the set of FileObjects included in the FileSet:

  1. The includes pattern(s) are evaluated first
  2. If multiple includes are specified, the union of their results is taken
  3. Then all the files corresponding to the excludes patterns are removed from that set
  4. Patterns are evaluated from the root of the containedIn contents (e.g., the top level directory extracted from an archive)

Examples

Simple Image Archive

{
  "@type": "cr:FileObject",
  "@id": "train2014.zip",
  "contentSize": "13510573713 B",
  "contentUrl": "http://images.cocodataset.org/zips/train2014.zip",
  "encodingFormat": "application/zip",
  "sha256": "sha256"
},
{
  "@type": "cr:FileSet",
  "@id": "image-files",
  "containedIn": { "@id": "train2014.zip" },
  "encodingFormat": "image/jpeg",
  "includes": "*.jpg"
}

Complex Archive with Multiple FileSets

{
  "@type": "cr:FileObject",
  "@id": "flores200_dataset.tar.gz",
  "description": "Flores 200 is hosted on a webserver.",
  "contentSize": "25585843 B",
  "contentUrl": "https://tinyurl.com/flores200dataset",
  "encodingFormat": "application/x-gzip",
  "sha256": "c764ffdeee4894b3002337c5b1e70ecf6f514c00"
},
{
  "@type": "cr:FileSet",
  "@id": "files-dev",
  "description": "dev files are inside the tar.",
  "containedIn": { "@id": "flores200_dataset.tar.gz" },
  "encodingFormat": "application/json",
  "includes": "flores200_dataset/dev/*.dev"
},
{
  "@type": "cr:FileSet",
  "@id": "files-devtest",
  "description": "devtest files are inside the tar.",
  "containedIn": { "@id": "flores200_dataset.tar.gz" },
  "encodingFormat": "application/json",
  "includes": "flores200_dataset/devtest/*.devtest"
}

This example shows how multiple FileSets can be extracted from a single archive, each with different inclusion patterns to select different subsets of files.


Format

URI: http://mlcommons.org/croissant/Format

Description: Specifies how to parse the format of the data from a string representation. For example, format may hold a date format string, a number format, or a bounding box format.

Subclass of:

Documentation:

A format string used to parse the values coming from a DataSource. For example, a date may be represented as the string “2022/11/10”, and interpreted into the correct date via the format “yyyy/MM/dd”. Formats correspond to a target data type.

Supported Format Types

Data types Format Example
sc:Date
sc:DateTime
CLDR Date/Time Patterns MM/dd/yyyy
sc:Number
sc:Float
sc:Integer
CLDR Number and Currency patterns 0.##E0 (scientific notation with max 2 decimals)
cr:BoundingBox Keras bounding box format CENTER_XYWH

Note: This list is not exhaustive, and not all Croissant implementations will support all formats.

Examples

Date Format Parsing

{
  "source": {
    "fileObject": { "@id": "metadata" },
    "extract": { "column": "datetaken" },
    "format": "%Y-%m-%d %H:%M:%S.%f"
  }
}

Bounding Box Format

{
  "@type": "cr:Field",
  "@id": "images/annotations/bbox",
  "description": "The bounding box around annotated object[s].",
  "dataType": "cr:BoundingBox",
  "source": {
    "fileSet": { "@id": "instancesperson_keypoints_annotations" },
    "extract": { "column": "bbox" },
    "format": "CENTER_XYWH"
  }
}

Usage

Format specifications are typically used within DataSource definitions to ensure that string representations of structured data (like dates, numbers, or coordinates) are correctly parsed into their intended data types. This is particularly important for ML datasets where precise data interpretation is crucial for model training and evaluation.


RecordSet

URI: http://mlcommons.org/croissant/RecordSet

Description: A description of a set of structured records from one or more data sources and their structure, expressed as a set of fields.

Subclass of:

Properties:

Documentation:

A RecordSet describes a set of structured records obtained from one or more data sources (typically a file or set of files) and the structure of these records, expressed as a set of fields (e.g., the columns of a table). A RecordSet can represent flat or nested data.

Purpose

RecordSet provides a common structure description that can be used across different modalities, in terms of records that may contain multiple fields. It handles:

RecordSet is a subclass of schema.org/Intangible.

Properties

Property Expected Type Cardinality Description
field Field MANY A data element that appears in the records of the RecordSet (e.g., one column of a table)
key Text MANY One or more fields whose values uniquely identify each record in the RecordSet
data JSON MANY One or more records that constitute the data of the RecordSet
examples JSON
URL
MANY One or more records provided as example content of the RecordSet, or a reference to data source that contains examples

Additional Features

Examples

Simple Tabular RecordSet

{
  "@type": "cr:RecordSet",
  "@id": "ratings",
  "key": [{ "@id": "ratings/user_id" }, { "@id": "ratings/movie_id" }],
  "field": [
    {
      "@type": "cr:Field",
      "@id": "ratings/user_id",
      "dataType": "sc:Integer",
      "source": {
        "fileObject": { "@id": "ratings-table" },
        "extract": { "column": "userId" }
      }
    },
    {
      "@type": "cr:Field",
      "@id": "ratings/rating",
      "description": "The score of the rating on a five-star scale.",
      "dataType": "sc:Float",
      "source": {
        "fileObject": { "@id": "ratings-table" },
        "extract": { "column": "rating" }
      }
    }
  ]
}

Enumeration with Embedded Data

{
  "@type": "cr:RecordSet",
  "@id": "gender_enum",
  "description": "Maps gender ids (0, 1) to labeled values.",
  "key": { "@id": "gender_enum/id" },
  "field": [
    { "@id": "gender_enum/id", "@type": "cr:Field", "dataType": "sc:Integer" },
    { "@id": "gender_enum/label", "@type": "cr:Field", "dataType": "sc:String" }
  ],
  "data": [
    { "gender_enum/id": 0, "gender_enum/label": "Male" },
    { "gender_enum/id": 1, "gender_enum/label": "Female" }
  ]
}

Geographic Data with Type Mapping

{
  "@id": "cities",
  "@type": "cr:RecordSet",
  "dataType": "sc:GeoCoordinates",
  "field": [
    {
      "@id": "cities/latitude",
      "@type": "cr:Field"
    },
    {
      "@id": "cities/longitude", 
      "@type": "cr:Field"
    }
  ]
}

This example shows how RecordSets can be typed with semantic types like sc:GeoCoordinates, and fields can be implicitly mapped to properties of that type (latitude and longitude).


Transform

URI: http://mlcommons.org/croissant/Transform

Description: Specifies how to transform data extracted from a DataSource. The type of transformation depends on the type of content, e.g., a regular expression to appy on text, or a jsonQuery to transform JSON content.

Subclass of:

Properties:

Documentation:

Croissant supports a few simple transformations that can be applied on the source data. Transformations are used to modify extracted data before it’s included in the final dataset.

Supported Transformations

Examples

Regular Expression Transformation

{
  "fileSet": {
    "@id": "files"
  },
  "extract": {
    "fileProperty": "filename"
  },
  "transform": {
    "regex": "^(train|val|test)2014/.*\\.jpg$"
  }
}

This example extracts filenames and applies a regex to parse training/validation/test split information.

Filename Parsing

{
  "source": {
    "fileSet": { "@id": "image-files" },
    "extract": {
      "fileProperty": "filename"
    },
    "transform": {
      "regex": "([^\\/]*)\\.jpg"
    }
  }
}

This extracts the base filename (without path and extension) from image files.

Delimiter Transformation

{
  "transform": {
    "delimiter": ","
  }
}

This would split a comma-separated string into an array of values.

JSON Query Transformation

{
  "transform": {
    "jsonQuery": "$.metadata.authors[*].name"
  }
}

This would extract all author names from a JSON structure using a JSON query.

Usage

Transformations are typically used within DataSource definitions, applied after data extraction but before final formatting. They provide a way to clean, parse, or restructure data to make it suitable for machine learning workflows without requiring external preprocessing steps.


Properties (29)

citeAs

URI: http://mlcommons.org/croissant/citeAs

Description: How to cite this dataset. Ideally, citations should be expressed using the bibtex format. Note that this is different from schema.org/citation, which is used to make a citation to another publication from this dataset.

Domain:

Range:


column

URI: http://mlcommons.org/croissant/column

Description: In case the data source is tabular, the id of a column to extract.

Domain:

Range:


containedIn

URI: http://mlcommons.org/croissant/containedIn

Description: Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object.

Domain:

Range:


content

URI: http://mlcommons.org/croissant/content

Description: What to extract from the data source content, e.g., lines.

Domain:

Range:


data

URI: http://mlcommons.org/croissant/data

Description: One or more inlined records that constitute the data of the RecordSet, typically used for small enumeration values.

Domain:

Range:


dataType

URI: http://mlcommons.org/croissant/dataType

Description: The data type of the field, identified by the URI of the corresponding class. It could be either an atomic type (e.g, sc:Integer) or a semantic type (e.g., sc:GeoLocation).

Domain:

Range:


delimiter

URI: http://mlcommons.org/croissant/delimiter

Description: A delimiter to use parse the data into an array.

Domain:

Range:


equivalentProperty

URI: http://mlcommons.org/croissant/equivalentProperty

Description: A property that is equivalent to this Field. Used in the case a dataType is specified on the RecordSet to map specific fields to specific properties associated with that dataType.

Domain:

Range:


examples

URI: http://mlcommons.org/croissant/examples

Description: One more inlined records provided as example content of the RecordSet.

Domain:

Range:


excludes

URI: http://mlcommons.org/croissant/excludes

Description: A glob pattern that specifies the files to exclude. The pattern is evaluated from the root of the containedIn contents, after the includes patterns have been evaluated.

Domain:

Range:


extract

URI: http://mlcommons.org/croissant/extract

Description: The extraction method from the provided source.

Domain:

Range:


field

URI: http://mlcommons.org/croissant/field

Description: A data element that appears in the records of the RecordSet (e.g., one column of a table).

Domain:

Range:


fileObject

URI: http://mlcommons.org/croissant/fileObject

Description: The id of a FileObject that is the source of the data.

Domain:

Range:


fileProperty

URI: http://mlcommons.org/croissant/fileProperty

Description: The file property to extract from the data source metadata, e.g., the filename.

Domain:

Range:


fileSet

URI: http://mlcommons.org/croissant/fileSet

Description: The id of a FileSet that is the source of the data.

Domain:

Range:


format

URI: http://mlcommons.org/croissant/format

Description: A format to parse the values of the data from text, e.g., a date format or number format.

Domain:

Range:


includes

URI: http://mlcommons.org/croissant/includes

Description: A glob pattern that specifies the files to include, e.g., “.jpg”, “/foo/pic*.jpg”. The pattern is evaluated from the root of the containedIn contents.

Domain:

Range:


isLiveDataset

URI: http://mlcommons.org/croissant/isLiveDataset

Description: Indicates that the dataset is continuously updated instead of being versioned.

Domain:

Range:


jsonPath

URI: http://mlcommons.org/croissant/jsonPath

Description: In case the data source is JSON data, a path expression to extract a subset of the data.

Domain:

Range:


jsonQuery

URI: http://mlcommons.org/croissant/jsonQuery

Description: For JSON content, a query to evaluate on the data.

Domain:

Range:


key

URI: http://mlcommons.org/croissant/key

Description: One or more fields whose values uniquely identify each record in the RecordSet. (See example below.)

Domain:

Range:


parentField

URI: http://mlcommons.org/croissant/parentField

Description: A special case of SubField that should be hidden because it references a Field that already appears in the RecordSet.

Domain:

Range:


recordSet

URI: http://mlcommons.org/croissant/recordSet

Description: The id of a RecordSet that is the source of the data.

Domain:

Range:


references

URI: http://mlcommons.org/croissant/references

Description: Another Field of another RecordSet that this field references. This is the equivalent of a foreign key reference in a relational database.

Domain:

Range:


regex

URI: http://mlcommons.org/croissant/regex

Description: A regular expression to apply to the data.

Domain:

Range:


repeated

URI: http://mlcommons.org/croissant/repeated

Description: If true, then the Field is a list of values of type dataType.

Domain:

Range:


source

URI: http://mlcommons.org/croissant/source

Description: The data source of the field. This will generally reference a FileObject or FileSet’s contents (e.g., a specific column of a table).

Domain:

Range:


subField

URI: http://mlcommons.org/croissant/subField

Description: Another Field that is nested inside this one.

Domain:

Range:


transform

URI: http://mlcommons.org/croissant/transform

Description: A transformation to apply on source data on top of the extracted method as specified through extract, e.g., a regular expression or JSON query.

Domain:

Range:



🐢 Generated with turtle-matter v0.1.0