Column-level profile information
Each profile contains several levels of information.
The information is grouped as follows:
When the results of advanced profiling are written to an output table, values are stored as strings irrespective of the actual data type. In that case, string sort order is applied when you sort the data classes, formats, or types.
Statistics
The Statistics tab provides a summary of the structure of the analyzed data in a column and different types of visualizations for that structural information. What information exactly is shown depends on whether the column contains continuous (quantitative) or nominal (qualitative) data.
Charts
Depending on the type of data in a column, you can choose between different types of visualizations:
-
Nominal data:
- Bar chart
- Proportion or pie chart
- Pareto chart
-
Continuous data:
- Histogram chart
- Box plot chart
- Quantile-quantile (Q-Q) plot chart
A distribution chart is available for all types of data. The distribution table usually lists at least the most frequent values (or intervals) in the column and their counts. The table might show other information such as the formats, types, or data classes. To view the individual rows that contain a certain value, click Show rows.
On the bar or histogram charts, you have the option to select an overlay column to see how its values are distributed within each value of the column that you are currently looking at. For example, if you have column with sold baked goods and select an overlay column season, you can see how sales of a certain bakery product differ per season. For the overlay column, you can pick from all columns in the data asset that contain nominal data.
Summary
The Summary tile provides general information about the data in the selected column:
- The data type of the column as defined in the data source
- The data type that was inferred through analysis
- The number of different data formats in that column
- The most frequent inferred format for that column
- The assigned data class
- The type of data measurement (
nominal
orcontinuous
) - The number of rows (that is, the number of values) that were checked
Basic statistics
Basic statistics provide general information about the distribution and dispersion of the values in the selected column. Depending on a column’s data format, the statistics vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values while statistics for a column of data type string have minimum length, maximum length, and mean length values.
Measure | Description | Shown for this type of data |
---|---|---|
Cardinality | The percentage of unique distinct values in the column including blanks and nulls. It is calculated by dividing total number of distinct values in a column by the total number of values in that column. | Continuous |
Distinct | The number of different values that exist in the sampled data for the column. | Continuous |
Entropy | This value quantifies how much information the column holds. More generally, entropy can be used to quantify the information in an event and a random variable. This amount is estimated not only based on the number of different values that are present in the variable but also by the amount of unexpected values. | Nominal |
Gini | The degree of probability that a specific element is incorrectly classified when chosen randomly and a variation of the Gini coefficient. The Gini index can vary from 0 to 1, where 0 indicates that all the elements belong to a certain class or that only one class exists there. A Gini index of 1 indicates that all elements are randomly distributed across various classes. A value of 0.5 indicates that the elements are uniformly distributed across some classes | Nominal |
Maximum | The largest value of a numeric variable | Continuous |
Mean | The arithmetic average, the sum divided by the number of values | Continuous |
Median | The value above and below which half of the values fall. If there is an even number of values, the median is the average of the two middle values when they are sorted. The median is not affected by outliers | Continuous |
Minimum | The smallest value of a numeric variable | Continuous |
Missing | The number of rows in the sample that don't have a value. | Continuous Nominal |
Mode | The most frequently occurring value in the column. If several values occur with equal frequency, each of them is a mode. | Continuous Nominal |
Outliers | The number of values in the column data that are far away from most other values in the column. | Continuous |
Range | The difference between the maximum and minimum values in the column. | Continuous |
Sum | The sum or total of the values, across all columns that have values. | Continuous |
Unique | The number of distinct values that appear only once in the current column. | Continuous Nominal |
Valid | The number of values that are considered valid, which means empty or missing column values are excluded. | Continuous Nominal |
Advanced insights
In-depth information about the distribution and dispersion of the values in the selected column. This information is shown only for continuous data:
Measure | Description |
---|---|
25th percentile | The value below which 25% and above which 75% of the detected values fall. |
75th percentile | The value above which 25% and below which 75% of the detected values fall. |
Kurtosis | A measure of the extent to which there are outliers (tailedness of a distribution). Excess kurtosis is the tailedness of a distribution relative to a normal distribution. For a normal distribution, the value of the kurtosis
statistic is zero. Positive kurtosis indicates that the data exhibit more extreme outliers than a normal distribution. Negative kurtosis indicates that the data exhibit less extreme outliers than a normal distribution. Distributions with medium kurtosis (medium tails) are mesokurtic. Distributions with low kurtosis (thin tails) are platykurtic. |
Mean std. error | A measure of how far the sample mean (average) of the data is likely to be from the true population mean. |
Std. deviation | A measure of dispersion around the mean. With a low standard deviation, values are usually close to the mean. With a high standard deviation, the range of values is wider. |
Skewness | A measure of the asymmetry of a distribution. A distribution is asymmetrical when its left and right sides are not mirror images. A distribution can have right (or positive), left (or negative), or zero skewness (symmetric distribution). |
Variance | A measure of dispersion around the mean. It's the expectation of the squared deviation of a random variable from its population mean or sample mean. |
Data classes
The following information is shown for data class assignments:
-
The selected data class, which is the data class assigned to the column. It is the same as the detected data class unless you manually changed it.
-
The detected data class, which is the best matching data class for the column as detected by the analysis.
-
The confidence score of the assigned data class. The confidence of a data class is the percentage of nonnull values that match the data class. Several data classes are more generic identifiers that are detected and assigned at a column level. These data classes are assigned when a more specific data class could not be identified at a value level. Generic identifiers will always have a confidence of 100% and include the following data classes: Code, Identifier, Indicator, Quantity, and Text
-
A list of all data classes that were detected during analysis in descending order, with the best match (the highest confidence) at the top. For each data class, the confidence score and the data class priority are shown.
-
For each detected data class, additional information might be shown depending on the scope of the data class.
For data classes where the matching is done based on column data, column values that matched the criteria for this specific data class are listed. The Count (%) column shows how many rows in the sample contain a specific value and the percentage of rows with that value. In addition, the format of each matching value is shown.
For data classes where the matching is done based on the column name and for the generic data classes Code, Identifier, Indicator, Quantity, and Text no additional information is shown. These data classes are used when the data values don't allow for identifying a specific data class. The generic data classes always have a confidence of 100%.
For more information, see Data classes.
Formats
The format inferred for the column, the number of detected formats, and a list of all detected formats is shown.
A format represents the character pattern of a data value. Every alphabetic character is represented by an uppercase or lowercase letter A, depending on the capitalization of the character. Every numeric character is represented by the number 9. Spaces and special characters are shown as they appear.
The list of detected formats shows how many values with a specific format were found and the overall percentage of values with that format. Click an entry to see the values that match the pattern. Note that only 100 values are retrieved for display so that the value list might not contain all values or might even be empty.
Types
Following information is shown:
- The data type of the column as defined in the data source
- The data type that was inferred through analysis
- The minimum length of a value in that column
- The maximum length of a value in that column
- The average length of column values
- A list of all data types in the column
The data type describes whether the column contains data that is of a certain type, such as integer, string, or date type.
Typically, a column's optimal data type is obvious because most or all of the column values are of the same data type. However, when the list contains multiple different data types, check the frequency count for the inferred data type. If that frequency count is low relative to the row count of the table, invalid data values might cause the wrong data type to be inferred.
Learn more
Parent topic: Reviewing metadata enrichment results