aweSOM package

Submodules

aweSOM.make_sce_clusters module

aweSOM.make_sce_clusters.combine_separated_clusters(map_list: list, cluster_ranges: list[list[int]], dims: int, file_path: str) numpy.ndarray[source]

Combine separated clusters by summing their corresponding gsum masks.

Parameters:
  • map_list (list) – A list of instances representing the binary maps.

  • cluster_ranges (list[list[int]]) – A list of ranges indicating the start and end indices for each cluster.

  • dims (int) – The dimensions of the binary maps.

  • file_path (str) – The file path where the binary maps are stored.

Returns:

A numpy array containing the summed binary maps for each cluster.

Return type:

np.ndarray

aweSOM.make_sce_clusters.get_gsum_values(mapping_file: str)[source]

Get the gsum values from the mapping file

Parameters:

mapping_file (str) – path to the mapping file

Returns:

gsum values dict: mapping of gsum values to cluster id and cluster name

Return type:

list

aweSOM.make_sce_clusters.get_sce_cluster_separation(gsum_deriv: numpy.ndarray, threshold: float)[source]

Identify the separation of clusters in a given derivative array based on a specified threshold.

Parameters:
  • gsum_deriv (np.ndarray) – A 1D array representing the derivative values.

  • threshold (float) – The threshold value used to determine cluster separation.

Returns:

A tuple containing:
  • list: A list of ranges for the identified clusters, where each range is represented as a list of two integers.

  • list: A list of indices representing the local minima found below the threshold.

Return type:

tuple

aweSOM.make_sce_clusters.make_file_name(n: int, ext: str) str[source]

Make a filename based on the number and the extension given.

Parameters:
  • n (int) – number to be converted to a filename

  • ext (str) – file extension

Returns:

filename

Return type:

str

aweSOM.make_sce_clusters.parse_args()[source]

argument parser for the make_sce_clusters.py script

aweSOM.make_sce_clusters.plot_gsum_deriv(gsum_deriv: numpy.ndarray, threshold: float, minimas: list[int] = None, file_path: str = None)[source]

Plots the gsum derivative with optional minima highlighted.

Parameters:
  • gsum_deriv (np.ndarray) – An array of gsum derivative values to be plotted.

  • threshold (float) – The threshold value to draw a horizontal line on the plot.

  • minimas (list[int], optional) – A list of indices representing the minima to be highlighted on the plot. Defaults to None.

  • file_path (str, optional) – The file path where the plot will be saved. If None, the plot will be displayed instead. Defaults to None.

Returns:

This function does not return a value. It either displays the plot or saves it to a file.

Return type:

None

aweSOM.make_sce_clusters.plot_gsum_values(gsum_values: list[float], minimas: list[int] = None, file_path: str = None)[source]

Plot the gsum values with optional minima markers.

Parameters:
  • gsum_values (list[float]) – A list of gsum values to plot.

  • minimas (list[int], optional) – A list of indices indicating the minima to highlight. Defaults to None.

  • file_path (str, optional) – The directory path where the plot will be saved. If None, the plot will be displayed. Defaults to None.

Returns:

This function does not return a value. It either displays the plot or saves it to a file.

Return type:

None

aweSOM.run_som module

aweSOM.run_som.batch_separator(data: numpy.ndarray, number_of_batches: int) numpy.ndarray[source]

Given a dataset and a number of batches, return a list of datasets each containing the same number of data points.

Parameters:
  • data (np.ndarray) – N x f dataset, N is the number of data points and f is the number of features

  • number_of_batches (int) – number of batches to create (b)

Returns:

b x N//b x f list of datasets

Return type:

np.ndarray

aweSOM.run_som.initialize_lattice(data: numpy.ndarray, ratio: float) list[int][source]

Given a N x f dataset and a ratio, return the dimensions of the SOM lattice based on Kohonen’s advice.

Parameters:
  • data (np.ndarray) – N x f dataset, N is the number of data points and f is the number of features

  • ratio (float) – height to width ratio of the lattice, between 0 and 1.

Returns:

[xdim, ydim] dimensions of the lattice

Return type:

list[int]

aweSOM.run_som.inv_manual_scaling(normed_data: numpy.ndarray, ori_data: numpy.ndarray, bulk_range: float = 1.0) numpy.ndarray[source]

Given a value that has been scaled using manual_scaling, return the original value.

Parameters:
  • normed_data (np.ndarray) – 2d array of data (M x f)

  • ori_data (np.ndarray) – 2d array of original data (N x f)

  • bulk_range (float, optional) – The extent to which 95% of the data resides in. Defaults to 1..

Returns:

unscaled data

Return type:

np.ndarray

aweSOM.run_som.manual_scaling(data: numpy.ndarray, bulk_range: float = 1.0) numpy.ndarray[source]

Scale data to a range that centers on 0. and contains 95% of the data within the range.

Parameters:
  • data (np.ndarray) – 2d array of data (N x f)

  • bulk_range (float, optional) – The extent to which 95% of the data resides in. Defaults to 1..

Returns:

scaled data

Return type:

np.ndarray

aweSOM.run_som.number_of_nodes(N: int, f: int) int[source]

Given a dataset with N data points and f features, return the number of nodes in the SOM lattice.

Parameters:
  • N (int) – number of data points

  • f (int) – number of features

Returns:

number of nodes in the lattice

Return type:

int

aweSOM.run_som.parse_args()[source]

CLI argument parser for run_som.py script.

aweSOM.run_som.save_cluster_labels(som_labels: numpy.ndarray, xdim: int, ydim: int, alpha_0: float, train: int, batch: int = 1, initial: str = 's', name_of_dataset: str = '')[source]

Saves the cluster labels to a numpy file.

Parameters:
  • som_labels (np.ndarray) – The cluster labels to be saved.

  • xdim (int) – The x-dimension of the SOM grid.

  • ydim (int) – The y-dimension of the SOM grid.

  • alpha_0 (float) – The initial learning rate.

  • train (int) – The number of training iterations.

  • batch (int, optional) – The batch size. Defaults to 1.

  • initial (str, optional) – The type of initialization. Defaults to “s”.

  • name_of_dataset (str, optional) – The name of the dataset. Defaults to “”.

aweSOM.run_som.save_som_object(som: Lattice, xdim: int, ydim: int, alpha_0: float, train: int, batch: int = 1, initial: str = 's', name_of_dataset: str = '')[source]

Save the SOM object to a pickle file.

Parameters:
  • som (aweSOM.Lattice) – The SOM object to be saved.

  • xdim (int) – The x-dimension of the SOM lattice.

  • ydim (int) – The y-dimension of the SOM lattice.

  • alpha_0 (float) – The initial learning rate of the SOM.

  • train (int) – The number of training iterations.

  • batch (int, optional) – The batch size for training. Defaults to 1.

  • initial (str, optional) – The type of initial weights. Defaults to “s”.

  • name_of_dataset (str, optional) – The name of the dataset. Defaults to “”.

aweSOM.sce module

aweSOM.sce.compute_SQ(mask: numpy.ndarray, maskC: numpy.ndarray)[source]

Compute the quality index between two masks

Parameters:
  • mask ((j)np.ndarray) – mask of cluster C

  • maskC ((j)np.ndarray) – mask of cluster C’

Returns:

quality index, equals to S/Q SQ_matrix ((j)np.ndarray): pixelwise quality index, equals to S/Q * mask

Return type:

SQ (float)

aweSOM.sce.conditional_jit(func)[source]
aweSOM.sce.create_mask(img: numpy.ndarray, cid: int) numpy.ndarray[source]

Create a mask for a given cluster id

Parameters:
  • img (jnp.ndarray) – 3D array of cluster ids

  • cid (int) – cluster id to mask

Returns:

masked cluster, 1 where cluster id is cid, 0 elsewhere

Return type:

(j)np.ndarray

aweSOM.sce.find_number_of_clusters(cluster_files: list[str]) numpy.ndarray[source]

Find the number of clusters in each run.

Parameters:

cluster_files (list[str]) – A list of data files saved in ‘.npy’ format.

Returns:

An array of the number of cluster ids in each run.

Return type:

number_of_clusters ((j)np.ndarray)

aweSOM.sce.load_som_npy(path: str) numpy.ndarray[source]
aweSOM.sce.loop_over_all_clusters(all_files: list[str], number_of_clusters: numpy.ndarray, dimensions: numpy.ndarray, subfolder: str = 'SCE') int[source]

Loops over all clusters in the given data, compute goodness-of-fit, then save Gsum values to file.

Parameters:
  • all_files (list[str]) – A list of data files saved in ‘.npy’ format.

  • number_of_clusters ((j)np.ndarray) – An array of the number of cluster ids in each run.

  • dimensions (np.ndarray) – A 1d array representing the dimensions of the clusters (can be any dimension but nx*ny*nz has to be equal to number of data points).

  • subfolder (str) – The name of the subfolder to save the results to.

Returns:

Save Gsum value of each cluster C to a file.

aweSOM.sce.parse_args()[source]

argument parser for the sce.py script

aweSOM.som module

class aweSOM.som.Lattice(xdim: int = 10, ydim: int = 10, alpha_0: float = 0.3, train: int = 1000, alpha_type: str = 'decay', sampling_type: str = 'sampling')[source]

Bases: object

static Gamma(index_bmu: int, m2Ds: numpy.ndarray, alpha: float, nsize: int, gaussian: bool = True)

Calculate the neighborhood function for a given BMU on a lattice.

Parameters:
  • index_bmu (int) – The index of the BMU node on the lattice.

  • m2Ds (np.ndarray) – Lattice coordinate of each node.

  • alpha (float) – The amplitude parameter for the Gaussian function, AKA the learning rate.

  • nsize (int) – The size of the neighborhood.

  • gaussian (bool, optional) – Whether to use Gaussian function or not. Defaults to True.

Returns:

The neighborhood function values for each node on the grid.

Return type:

np.ndarray

static assign_cluster_to_data(projection_2d: numpy.ndarray, clusters_on_lattice: numpy.ndarray) numpy.ndarray

Given a lattice and cluster assignments on that lattice, return the cluster ids of the data (in a 1d array)

Parameters:
  • projection_2d (np.ndarray) – 2d array with x-y coordinates of the node associated with each data point

  • clusters_on_lattice (np.ndarray) – X x Y matrix of cluster labels on lattice

Returns:

cluster_id of each data point

Return type:

np.ndarray

assign_cluster_to_lattice(smoothing=None, merge_cost=0.0)[source]

Assigns clusters to the lattice based on the computed centroids.

Parameters:
  • smoothing (float, optional) – Smoothing parameter for computing Umatrix. Defaults to None.

  • merge_cost (float, optional) – Cost threshold for merging similar centroids. Defaults to 0.0.

Returns:

Array representing the assigned clusters for each lattice point.

Return type:

numpy.ndarray

static best_match(lattice: numpy.ndarray, obs: numpy.ndarray, full=False) numpy.ndarray

Given input vector inp[n,f] (where n is number of different observations, f is number of features per observation), return the best matching node.

Parameters:
  • lattice (np.ndarray) – weight values of the lattice

  • obs (np.ndarray) – observations (input vectors)

  • full (bool, optional) – indicate whether to return first and second best match. Defaults to False.

Returns:

return the 1d index of the best-matched node (within the lattice) for each observation

Return type:

np.ndarray

compute_centroids(explicit=False)[source]

Compute the centroid for each node in the lattice given a precomputed Umatrix.

Parameters:

explicit (bool) – Controls the shape of the connected component.

Returns:

A dictionary containing the matrices with the same x-y dimensions as the original map, containing the centroid x-y coordinates.

Return type:

dict

compute_heat(d, smoothing=None)[source]

Compute a heat value map representation of the given distance matrix.

Parameters:
  • d (numpy.ndarray) – A distance matrix computed via the ‘dist’ function.

  • smoothing (float, optional) – A positive floating point value controlling the smoothing of the umat representation. Defaults to None.

Returns:

A matrix with the same x-y dimensions as the original map containing the heat.

Return type:

numpy.ndarray

compute_umat(smoothing=None)[source]

Compute the unified distance matrix.

Parameters:

smoothing (float, optional) – A positive floating point value controlling the smoothing of the umat representation. Defaults to None.

Returns:

A matrix with the same x-y dimensions as the original map containing the umat values.

Return type:

numpy.ndarray

static coordinate(rowix: numpy.ndarray, xdim: int) numpy.ndarray

Convert from a list of row index to an array of xy-coordinates.

Parameters:
  • rowix (np.ndarray) – 1d array with the 1d indices of the points of interest (n x 1 matrix)

  • xdim (int) – x dimension of the lattice

Returns:

array with x and y coordinates of each point in rowix

Return type:

np.ndarray

fast_som()[source]

Performs the self-organizing map (SOM) training.

This method initializes the SOM with random values or a subset of the data, and then trains the SOM by updating the node weights based on the input vectors. The training process includes adjusting the learning rate, shrinking the neighborhood size, and saving the node weights and U-matrix periodically.

get_unique_centroids(centroids)[source]

Print out a list of unique centroids given a matrix of centroid locations.

Parameters:

centroids – A matrix of the centroid locations in the map.

Returns:

A dictionary containing the unique x and y positions of the centroids. The dictionary has the following keys: position_x: A list of unique x positions. position_y: A list of unique y positions.

list_clusters(centroids, unique_centroids)[source]

Get the clusters as a list of lists., not very useful

Parameters:
  • centroids (matrix) – A matrix of the centroid locations in the map.

  • unique_centroids (list) – A list of unique centroid locations.

Returns:

A list of clusters associated with each unique centroid.

Return type:

list

list_from_centroid(x, y, centroids)[source]

Get all cluster elements associated with one centroid.

Parameters:
  • x (int) – The x position of a centroid.

  • y (int) – The y position of a centroid.

  • centroids (numpy.ndarray) – A matrix of the centroid locations in the map.

Returns:

A list of cluster elements associated with the given centroid.

Return type:

list

map_data_to_lattice()[source]

After training, map each data point to the nearest node in the lattice.

Returns:

A 2D array with the x and y coordinates of the best matching nodes for each data point.

Return type:

np.ndarray[int]

merge_similar_centroids(naive_centroids: numpy.ndarray, threshold=0.3)[source]

Merge centroids that are close enough together.

Parameters:
  • naive_centroids (np.ndarray) – original centroids before merging

  • threshold (float, optional) – Any centroids with pairwise cost less than this threshold is merged. Defaults to 0.3.

Returns:

new node map with combined centroids

Return type:

np.ndarray

node_weight(x, y)[source]

Returns the weight values of a node at (x,y) on the lattice.

Parameters:
  • x (int) – x-coordinate of the node.

  • y (int) – y-coordinate of the node.

Returns:

1d array of weight values of said node.

Return type:

np.ndarray

plot_heat(heat, explicit=False, comp=True, merge=False, merge_cost=0.001)[source]

Plot the heat map of the given data.

Parameters:
  • heat (array-like) – The data to be plotted.

  • explicit (bool, optional) – A flag indicating whether the connected components are explicit. Defaults to False.

  • comp (bool, optional) – A flag indicating whether to plot the connected components. Defaults to True.

  • merge (bool, optional) – A flag indicating whether to merge the connected components. Defaults to False.

  • merge_cost (float, optional) – The threshold for merging the connected components. Defaults to 0.001.

static replace_value(centroids: dict[str, numpy.ndarray], centroid_a: tuple, centroid_b: tuple) dict[str, numpy.ndarray][source]

Replaces the values of centroid_a with the values of centroid_b in the given centroids dictionary.

Parameters:
  • centroids (dict[str, np.ndarray]) – A dictionary containing the centroids.

  • centroid_a (tuple) – The coordinates of the centroid to be replaced.

  • centroid_b (tuple) – The coordinates of the centroid to replace with.

Returns:

The updated centroids dictionary.

Return type:

dict[str, np.ndarray]

rowix(x, y)[source]

Convert from a xy-coordinate to a row index.

Parameters:
  • x (int) – The x-coordinate of the map.

  • y (int) – The y-coordinate of the map.

Returns:

The row index corresponding to the given xy-coordinate.

Return type:

int

smooth_2d(Y, ind=None, weight_obj=None, grid=None, nrow=64, ncol=64, surface=True, theta=None)[source]

Smooths 2D data using a kernel smoother., internal function, no user-facing aspect

Parameters:
  • Y (array-like) – The input data to be smoothed.

  • ind (array-like, optional) – The indices of the data to be smoothed. Defaults to None.

  • weight_obj (dict, optional) – The weight object used for smoothing. Defaults to None.

  • grid (dict, optional) – The grid object used for smoothing. Defaults to None.

  • nrow (int, optional) – The number of rows in the grid. Defaults to 64.

  • ncol (int, optional) – The number of columns in the grid. Defaults to 64.

  • surface (bool, optional) – Flag indicating whether the data represents a surface. Defaults to True.

  • theta (float, optional) – The theta value used in the exponential covariance function. Defaults to None.

Returns:

The smoothed data.

Return type:

array-like

Raises:

None

train_lattice(data: numpy.ndarray, features_names: list[str], labels: numpy.ndarray = None, number_of_steps: int = -1, save_lattice: bool = False, restart_lattice: numpy.ndarray = None)[source]

Train the Model with numba JIT acceleration.

Parameters:
  • data (np.ndarray) – A numpy 2D array where each row contains an unlabeled training instance.

  • features_names (list[str]) – A list of feature names.

  • labels (np.ndarray, optional) – A vector with one label (ground truth) for each observation in data. Defaults to None.

  • number_of_steps (int) – Number of steps taken this batch, used for keeping track of training restarts. Default is self.train.

  • save_lattice (bool, optional) – A flag that determines whether the node weights are saved to a file at the end of training. Defaults to False.

  • restart_lattice (np.ndarray, optional) – Vectors for the weights of the nodes from past realizations. Defaults to None.

Module contents