Data#
- class glue.core.data.Data(label='', coords=None, **kwargs)[source]#
Bases:
BaseCartesianData
The basic data container in Glue.
The data object stores data as a collection of
Component
objects. Each component stored in a dataset must have the same shape.Catalog data sets are stored such that each column is a distinct 1-dimensional
Component
.There are several ways to extract the actual numerical data stored in a
Data
object:data = Data(x=[1, 2, 3], label='data') xid = data.id['x'] data[xid] data.get_component(xid).data data['x'] # if 'x' is a unique component name
Likewise, datasets support fancy indexing:
data[xid, 0:2] data[xid, [True, False, True]]
- Parameters:
- labelstr
The name of the dataset.
- coords
Coordinates
The coordinates object to use to define world coordinates.
Attributes Summary
All
ComponentIDs
in the Data.The ComponentIDs associated with a
CoordinateComponent
.A list of the ComponentLinks that connect pixel and world.
The coordinates object for the data.
The ComponentIDs for each
DerivedComponent
.A list of the links present inside all of the DerivedComponent objects in this dataset.
The name of the dataset
A list of all the links internal to the dataset.
The number of dimensions of the data, as an integer.
The
ComponentIDs
for each pixel coordinate.The ComponentIDs not associated with a
DerivedComponent
The n-dimensional shape of the dataset, as a tuple.
The size of the data (the product of the shape dimensions), as an integer.
All
ComponentIDs
in the Data that aren't coordinates.The
ComponentIDs
for each world coordinate.Methods Summary
add_component
(component, label)Add a new component to this data set.
add_component_link
(link[, label])Shortcut method for generating a new
DerivedComponent
from a ComponentLink object, and adding it to a data set.Equivalent to
Data.components
compute_fixed_resolution_buffer
(*args, **kwargs)Get a fixed-resolution buffer.
compute_histogram
(cids[, weights, range, ...])Compute an n-dimensional histogram with regularly spaced bins.
compute_statistic
(statistic, cid[, ...])Compute a statistic for the data.
dtype
(cid)Lookup the dtype for the data associated with a ComponentID
find_component_id
(label)Retrieve component_ids associated by label name.
get_component
(component_id)Fetch the component corresponding to component_id.
get_data
(cid[, view])Get the data values for a given component
get_kind
(cid)Get the kind of data for a given component.
get_mask
(subset_state[, view])Get a boolean mask for a given subset state.
join_on_key
(other, cid, cid_other)Create an element mapping to another dataset, by joining on values of ComponentIDs in both datasets.
remove_component
(component_id)Remove a component from a data set
reorder_components
(component_ids)Reorder the components using a list of component IDs.
to_dataframe
([index])Convert the Data object into a
pandas.DataFrame
object.update_components
(mapping)Change the numerical data associated with some of the Components in this Data object.
update_id
(old, new)Reassign a component to a different
glue.core.component_id.ComponentID
update_values_from_data
(data)Replace numerical values in data to match values from another dataset.
Attributes Documentation
- components[source]#
All
ComponentIDs
in the Data.- Returns:
- list
- coordinate_components[source]#
The ComponentIDs associated with a
CoordinateComponent
.- Returns:
- list
- coordinate_links[source]#
A list of the ComponentLinks that connect pixel and world. If no coordinate transformation object is present, return an empty list.
- derived_components[source]#
The ComponentIDs for each
DerivedComponent
.- Returns:
- list
- derived_links[source]#
A list of the links present inside all of the DerivedComponent objects in this dataset.
- pixel_component_ids[source]#
The
ComponentIDs
for each pixel coordinate.
- primary_components[source]#
The ComponentIDs not associated with a
DerivedComponent
This property is deprecated.
- visible_components[source]#
All
ComponentIDs
in the Data that aren’t coordinates.This property is deprecated.
- world_component_ids[source]#
The
ComponentIDs
for each world coordinate.
Methods Documentation
- add_component(component, label)[source]#
Add a new component to this data set.
- Parameters:
- component
Component
or array-like Object to add.
- labelstr or
ComponentID
The label. If this is a string, a new
glue.core.component_id.ComponentID
with this label will be created and associated with the Component.
- component
- Returns:
glue.core.component_id.ComponentID
The ComponentID associated with the newly-added component.
- Raises:
- TypeError, if label is invalid.
- ValueError, if the component has an incompatible shape.
- add_component_link(link, label=None)[source]#
Shortcut method for generating a new
DerivedComponent
from a ComponentLink object, and adding it to a data set.- Parameters:
- link
ComponentLink
The link to use to generate a new component
- label
ComponentID
or str The ComponentID or label to attach to.
- link
- Returns:
- component
DerivedComponent
The component that was added
- component
- component_ids()[source]#
Equivalent to
Data.components
- compute_fixed_resolution_buffer(*args, **kwargs)[source]#
Get a fixed-resolution buffer.
- Parameters:
- boundslist
The list of bounds for the fixed resolution buffer. This list should have as many items as there are dimensions in
target_data
. Each item should either be a scalar value, or a tuple of(min, max, nsteps)
.- target_data
Data
, optional The data in whose frame of reference the bounds are defined. Defaults to
data
.- target_cid
ComponentID
, optional If specified, gives the component ID giving the component to use for the data values. Alternatively, use
subset_state
to get a subset mask.- subset_state
SubsetState
, optional If specified, gives the subset state for which to compute a mask. Alternatively, use
target_cid
if you want to get data values.- broadcastbool, optional
If True, then if a dimension in
target_data
for whichbounds
is not a scalar does not affect any of the dimensions indata
, then the final array will be effectively broadcast along this dimension, otherwise an error will be raised.
- compute_histogram(cids, weights=None, range=None, bins=None, log=None, subset_state=None, random_subset=None)[source]#
Compute an n-dimensional histogram with regularly spaced bins.
Currently this only implements 1-D histograms.
- Parameters:
- cidslist of str or
ComponentID
Component IDs to compute the histogram over.
- weightsstr or
ComponentID
Component ID to use for the histogram weights.
- rangelist of tuple
The
(min, max)
of the histogram range.- binslist of int
The number of bins.
- loglist of bool
Whether to compute the histogram in log space.
- subset_state
SubsetState
, optional If specified, the histogram will only take into account values in the subset state.
- random_subsetint, optional
If specified, this should be an integer giving the number of values to use for the statistic.
- cidslist of str or
- compute_statistic(statistic, cid, subset_state=None, axis=None, finite=True, positive=False, percentile=None, view=None, random_subset=None, n_chunk_max=40000000)[source]#
Compute a statistic for the data.
- Parameters:
- statistic{‘minimum’, ‘maximum’, ‘mean’, ‘median’, ‘sum’, ‘percentile’}
The statistic to compute
- cid
ComponentID
or str The component ID to compute the statistic on - if given as a string this will be assumed to be for the component belonging to the dataset (not external links).
- subset_state
SubsetState
, optional If specified, the statistic will only include the values that are in the subset specified by this subset state.
- axisint or tuple of int, optional
If specified, the axis/axes to compute the statistic over.
- finitebool, optional
Whether to include only finite values in the statistic. This should be True to ignore NaN/Inf values
- positivebool, optional
Whether to include only (strictly) positive values in the statistic. This is used for example when computing statistics of data shown in log space.
- percentilefloat, optional
If
statistic
is'percentile'
, thepercentile
argument should be given and specify the percentile to calculate in the range [0:100]- random_subsetint, optional
If specified, this should be an integer giving the number of values to use for the statistic. This can only be used if
axis
is None- n_chunk_maxint, optional
If there are more elements in the array than this value, operate in chunks with at most this size.
- find_component_id(label)[source]#
Retrieve component_ids associated by label name.
- Parameters:
- label ::class:`~glue.core.component_id.ComponentID` or `str`
ComponentID to search for.
- Returns:
ComponentID
or NoneThe associated ComponentID if label is found and unique, else None. First, this checks whether the component ID is present and unique in the primary (non-derived) components of the data, and if not then the derived components are checked. If there is one instance of the label in the primary and one in the derived components, the primary one takes precedence.
- get_component(component_id)[source]#
Fetch the component corresponding to component_id.
- Parameters:
- component_id
ComponentID
The ID for the component to retrieve.
- component_id
- get_data(cid, view=None)[source]#
Get the data values for a given component
- Parameters:
- cid
ComponentID
The component ID to get the data for.
- viewslice
The ‘view’ on the data - anything that is considered a valid Numpy slice/index.
- cid
- get_kind(cid)[source]#
Get the kind of data for a given component.
- Parameters:
- cid
ComponentID
The component ID to get the data kind for
- cid
- Returns:
- kind{‘numerical’, ‘categorical’, ‘datetime’}
The kind of data for the given component ID.
- get_mask(subset_state, view=None)[source]#
Get a boolean mask for a given subset state.
- Parameters:
- subset_state
SubsetState
The subset state to use to compute the mask
- viewslice
The ‘view’ on the mask - anything that is considered a valid Numpy slice/index.
- subset_state
- join_on_key(other, cid, cid_other)[source]#
Create an element mapping to another dataset, by joining on values of ComponentIDs in both datasets.
This join allows any subsets defined on other to be propagated to self. The different ways to call this method are described in the Examples section below.
- Parameters:
- other
Data
Data object to join with.
- cidstr or
ComponentID
or iterable thereof Component(s) in this dataset to use as a key.
- cid_otherstr or
ComponentID
or iterable thereof Component(s) in the other dataset to use as a key.
- other
Examples
There are several ways to use this function, depending on how many components are passed to
cid
andcid_other
.Joining on single components
First, one can specify a single component ID for both
cid
andcid_other
: this is the standard mode, and joins one component from one dataset to the other:>>> d1 = Data(x=[1, 2, 3, 4, 5], k1=[0, 0, 1, 1, 2], label='d1') >>> d2 = Data(y=[2, 4, 5, 8, 4], k2=[1, 3, 1, 2, 3], label='d2') >>> d2.join_on_key(d1, 'k2', 'k1')
Selecting all values in
d1
where x is greater than 2 returns the last three items as expected:>>> s = d1.new_subset() >>> s.subset_state = d1.id['x'] > 2 >>> s.to_mask() array([False, False, True, True, True], dtype=bool)
The linking was done between k1 and k2, and the values of k1 for the last three items are 1 and 2 - this means that the first, third, and fourth item in
d2
will then get selected, since k2 has a value of either 1 or 2 for these items.>>> s = d2.new_subset() >>> s.subset_state = d1.id['x'] > 2 >>> s.to_mask() array([ True, False, True, True, False], dtype=bool)
Joining on multiple components
Note
This mode is currently slow, and will be optimized significantly in future.
Next, one can specify several components for each dataset: in this case, the number of components given should match for both datasets. This causes items in both datasets to be linked when (and only when) the set of keys match between the two datasets:
>>> d1 = Data(x=[1, 2, 3, 5, 5], ... y=[0, 0, 1, 1, 2], label='d1') >>> d2 = Data(a=[2, 5, 5, 8, 4], ... b=[1, 3, 2, 2, 3], label='d2') >>> d2.join_on_key(d1, ('a', 'b'), ('x', 'y'))
Selecting all items where x is 5 in
d1
in which x is a component works as expected and selects the two last items:>>> s = d1.new_subset() >>> s.subset_state = d1.id['x'] == 5 >>> s.to_mask() array([False, False, False, True, True], dtype=bool)
If we apply this selection to
d2
, only items where a is 5 and b is 2 will be selected:>>> s = d2.new_subset() >>> s.subset_state = d1.id['x'] == 5 >>> s.to_mask() array([False, False, True, False, False], dtype=bool)
and in particular, the second item (where a is 5 and b is 3) is not selected.
One-to-many and many-to-one joining
Finally, you can specify one component in one dataset and multiple ones in the other. In the case where one component is specified for this dataset and multiple ones for the other dataset, then when an item is selected in the other dataset, it will cause any item in the present dataset which matches any of the keys in the other data to be selected:
>>> d1 = Data(x=[1, 2, 3], label='d1') >>> d2 = Data(a=[1, 1, 2], ... b=[2, 3, 3], label='d2') >>> d1.join_on_key(d2, 'x', ('a', 'b'))
In this case, if we select all items in
d2
where a is 2, this will select the third item:>>> s = d2.new_subset() >>> s.subset_state = d2.id['a'] == 2 >>> s.to_mask() array([False, False, True], dtype=bool)
Since we have joined the datasets using both a and b, we select all items in
d1
where x is either the value or a or b (2 or 3) which means we select the second and third item:>>> s = d1.new_subset() >>> s.subset_state = d2.id['a'] == 2 >>> s.to_mask() array([False, True, True], dtype=bool)
We can also join the datasets the other way around:
>>> d1 = Data(x=[1, 2, 3], label='d1') >>> d2 = Data(a=[1, 1, 2], ... b=[2, 3, 3], label='d2') >>> d2.join_on_key(d1, ('a', 'b'), 'x')
In this case, selecting items in
d1
where x is 1 selects the first item, as expected:>>> s = d1.new_subset() >>> s.subset_state = d1.id['x'] == 1 >>> s.to_mask() array([ True, False, False], dtype=bool)
This then causes any item in
d2
where either a or b are 1 to be selected, i.e. the first two items:>>> s = d2.new_subset() >>> s.subset_state = d1.id['x'] == 1 >>> s.to_mask() array([ True, True, False], dtype=bool)
- remove_component(component_id)[source]#
Remove a component from a data set
- Parameters:
- component_id
ComponentID
The component to remove.
- component_id
- reorder_components(component_ids)[source]#
Reorder the components using a list of component IDs. The new set of component IDs has to match the existing set (though order may differ).
- to_dataframe(index=None)[source]#
Convert the Data object into a
pandas.DataFrame
object.- Parameters:
- indexindex-like
Any object that can be passed to the
pandas.Series
constructor.
- Returns:
- update_components(mapping)[source]#
Change the numerical data associated with some of the Components in this Data object.
All changes to component numerical data should use this method, which broadcasts the state change to the appropriate places.
- Parameters:
- mappingdict
A dictionary mapping Components or ComponenIDs to arrays.
- This method has the following restrictions:
New components must have the same shape as old components
Component subclasses cannot be updated.
- update_id(old, new)[source]#
Reassign a component to a different
glue.core.component_id.ComponentID
- Parameters:
- old
ComponentID
The old component ID.
- new
ComponentID
The new component ID.
- old
- update_values_from_data(data)[source]#
Replace numerical values in data to match values from another dataset.
Notes
This method drops components that aren’t present in the new data, and adds components that are in the new data that were not in the original data. The matching is done by component label, and components are resized if needed. This means that for components with matching labels in the original and new data, the
ComponentID
are preserved, and existing plots and selections will be updated to reflect the new values. Note that the coordinates are also copied, but the style is not copied.