Next Article in Journal
Second Hankel Determinant and Fekete–Szegö Problem for a New Class of Bi-Univalent Functions Involving Euler Polynomials
Previous Article in Journal
In Pursuit of BRST Symmetry and Observables in 4D Topological Gauge-Affine Gravity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets

Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Roma, Italy
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(5), 529; https://doi.org/10.3390/sym16050529
Submission received: 27 February 2024 / Revised: 11 April 2024 / Accepted: 22 April 2024 / Published: 28 April 2024
(This article belongs to the Topic Decision-Making and Data Mining for Sustainable Computing)

Abstract

:
The increasing availability of longitudinal data (repeated numerical observations of the same units at different times) requires the development of flexible techniques to automatically detect errors in such data. Besides standard types of errors, which can be treated with generic error correction techniques, large longitudinal datasets may present specific problems not easily traceable by the generic techniques. In particular, after applying those generic techniques, time series in the data may contain trends, natural fluctuations and possible surviving errors. To study the data evolution, one main issue is distinguishing those elusive errors from the rest, which should be kept as they are and not flattened or altered. This work responds to this need by identifying some types of elusive errors and by proposing a statistical-mathematical approach to capture their complexity that can be applied after the above generic techniques. The proposed approach is based on a system of indicators and works at the formal level by studying the differences between consecutive values of data series and the symmetries and asymmetries of these differences. It operates regardless of the specific meaning of the data and is thus applicable in a variety of contexts. We implement this approach in a relevant database of European Higher Education institutions (ETER) by analyzing two key variables: “Total academic staff” and “Total number of enrolled students”, which are two of the most important variables, often used in empirical analysis as a proxy for size, and are considered by policymakers at the European level. The results are very promising.

1. Introduction

In the context of an increasingly data-driven economy, data quality is of paramount importance for organizations of all types and sizes and a lack of attention to it can lead to several costs and inefficiencies. According to the quality framework of the Organisation for Economic Cooperation and Development (OECD) [1], data quality is defined as the “fitness for use” with respect to user needs. Data quality can be viewed as an overarching principle that must be kept into account when designing models of metrics [2]. Every technique developed to improve data quality should consider that the very concept of data quality is not one-dimensional but multidimensional [3,4,5]. In particular, the following seven dimensions are usually identified: accuracy, completeness, consistency, validity, timeliness, uniqueness, and integrity.
Due to the relevance of the issue, many authors have proposed methods or guidelines to assess problems with data quality [6,7,8,9,10,11,12]. However, few works focus on the problems that specifically regard the case of numerical data describing repeated observations of the same units over a period of time. This type of data is often called longitudinal data or panel data. If we restrict our attention to one single unit over the whole time period, then we obtain a single time series. If, on the contrary, we consider all the different units but restrict our attention to one single time instant, then we obtain cross-sectional data. In recent years, longitudinal data have become more and more abundant, and researchers have been exploring the vast possibilities given by their study, typically by using advanced artificial intelligence techniques that are now able to deal with huge datasets. However, one ubiquitous problem affecting almost all data-related applications is the presence of errors in the data. Unfortunately, longitudinal data make no exception to this. Thus, when data containing errors are used for some studies, the results will contain a certain degree of unreliability. Or, in other words, when data contain errors, the problem that we solve is actually different from the real problem that was to be solved.
The presence of errors in data may be due to several causes, and consequently, there exist many types of data errors. Easily identifiable cases are, for example, missing values or out-of-range values. For the rest of this work, we assume that the error location is unknown and that the original exact value is not available, since otherwise simple replacement operations would fix the issue. As this type of problem is very widespread, many techniques have been developed in different fields of science to cope with these situations. There exist several imputation techniques for the reconstruction of missing or out-of-range values; see for example [13,14]. Some methods for estimating measurement errors in longitudinal data are based on latent variable modeling [15,16]. Another technique, called the MultiTrait MultiError approach, is presented in [17] to estimate multiple types of errors concurrently using a combination of experimental design and latent variable modeling. Survey [18] identifies three main error types in time series: single-point big errors, single-point small errors and continuous errors. Continuous errors are the case where an error occurs at several consecutive time points. A single-point error occurs discontinuously in a time series and only occurs on a single data point at intervals. A big error means that the observed value of the data point is far from the true value, as opposed to a small error. Single-point errors may often be identified by searching for outliers. An outlier is “An observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” [19]. Following surveys [18,20], methods to detect anomalies/outliers in time series can be grouped as follows: density-based methods like Local Outlier Factor [21]; threshold methods based on statistical descriptions like standard deviations or median absolute deviation; see for example [22]; statistical test methods like Grubbs’ test [23]; machine learning methods based on the use of a training set of outliers, like isolation forest [24]; model-based methods like moving average (MA), autoregression (AR) and their variations (like ARMA or ARIMA); see for example [25]. These methods are effective in cleaning a single time series but suffer from the following limitations: they are generally quite data-demanding (so they do not scale well when the series is not long enough) and they often require statistical assumptions on the data (e.g., they need a Gaussian distribution).
Finally, there exist methods based on data integration, when the data under analysis are also contained or derivable from different sources [26,27], and methods based on checksum to reconstruct information that was originally known but has been damaged. However, these categories are not applicable to our case, because the original exact value is neither known nor available from other sources.
In addition to the “usual” types of errors, large longitudinal datasets may contain peculiar types of errors that are not easily identifiable by the techniques used for finding and/or correcting errors in generic datasets. In the context of numerical longitudinal data obtained from several sources and assembled to form one database, the following situations can lead to some very typical errors:
(1)
When the time series of the different units are written/stored one next to the other, one or more values from one unit A may be erroneously inserted in the space allowed to a contiguous unit B, and vice versa, the corresponding values from B may be inserted in the space of A. We call this situation the inversion of values between units. This type of error is often not detectable by general error detection techniques. Moreover, even if the problem is detected, because for example a value vi is too high or too low for unit A, the generalist imputation techniques will probably try to reconstruct the correct value based on elaborations involving unit A, ignoring that the correct values are already stored in the database but in the space of record B. Several problems may arise if this type of error is not fully recognized.
(2)
Data contain one or more large “jumps” in the values of the time series corresponding to one unit. For example, given a unit A, imagine that the values of one of its variables are 100, 120, 280, 130, 120, 150. The third value is far from the others, so we may suspect some problem. However, depending on the degree of volatility, the situation can also be normal after all. We call this situation an anomalous jump. In this case, we need to identify a threshold above which the values should be considered erroneous. This is a very delicate issue, and standard error detection techniques may be insufficient in this case.
(3)
A time series is composed of values produced by a data provider (for example, an agent or an organization) at every given interval of time (for example, every year). In this case, it may happen that the data provider computes a value vt for a given time t and later discovers that vt was incorrect, because some units should have been added to vt but they were not considered, so vt should actually be increased by δt, or because some units counted in vt actually belong to the next time interval, so vt should be decreased by −δt. In this case, if it is too late to modify vt, the data provider often tries to compensate for the error by modifying the next value produced, vt+1, providing vt+1 + δt in the first case and vt+1 − δt in the second. We call this situation recalculation operated by the data provider. Clearly, this type of problem is hardly detectable by general error detection techniques, and again, several problems may arise if this type of error is not fully recognized.
Another limitation of the above-described error correction techniques is that they generally look at the single time series and do not consider the whole database structure in the search for potential errors, so they overlook errors that may arise from its tabular format, like the inversion and the recalculation problems. This work responds to this need by proposing a statistical-mathematical approach based on a system of indicators that define a rational process to assess and improve the quality of the data (as suggested by [28]). In particular, the proposed approach is able to identify suspect erroneous data suffering from the three described problems by working at the formal level, regardless of the specific meaning of the data. Therefore, it is applicable even after a generic error correction step in a variety of contexts. Note that we aim at pinpointing the described error situations without flattening or altering the rest of the data, as some noise removal techniques unfortunately do. Therefore, one delicate issue is distinguishing natural fluctuations from erroneous jumps. We pursue this aim by studying the symmetries and asymmetries of the differences (deltas) between consecutive values of the time series. Our approach incorporates also a certain degree of flexibility, because it is based on a number of mathematical conditions that can be slightly changed to adapt to different cases and take into account different realities.
We implement this approach in a large and relevant database of European Higher Education institutions (ETER) by analyzing the two key variables “Total academic staff” and “Number of enrolled students”. These are two of the most important variables, often used in empirical analysis as a proxy for the size of the institutions, and are considered by the policymakers at the European level.

2. Materials and Methods

As explained in Section 1, there exist several error correction techniques for large datasets. Nonetheless, in large numerical longitudinal databases, we identify the following three main consistency problems that are specific to the case of longitudinal data and are hardly treatable with standard error detection and correction techniques:
  • Inversion of values between units;
  • Anomalous jump;
  • Recalculation operated by the data provider.
The proposed methodology aims to the identification of possible errors by raising check flags (which can later be examined by database managers) on suspect data. Our method can also be applied after other standard error correction procedures, and it consists of several steps for each of the above three problems, detailed in Section 2.1, Section 2.2 and Section 2.3. As materials, we conducted our experiments on the ETER database, described in Section 2.4.

2.1. Inversion Problem

To identify the inversion problem between two units A and B, we evaluate two types of conditions that we call here H1 and H2. The first type (H1) consists of assessing, for each possible couple of units A and B, whether there are possible systematic exchanges between the values of A and B over one or more time instants through the evaluation of the differences (called Δ) between each pair of temporally consecutive values of the same variable. In more detail, the generic condition H1 is evaluated by executing the following steps.
H1.a. 
Denote by i the index of the generic unit (a row in the dataset), with i = 1, …, m = U. The series of the values of a variable (or attribute) v over the time instants t = 1, …, n = S for unit i is denoted by (vi1, …, vin). Then, define Δvi(t,t+1) as the difference (delta) between the two values assumed by unit i in two consecutive time instants t, t + 1 for variable v as follows:
Δvi(t,t+1) = vitvit+1
Those deltas are computed for each period of the dataset and for each unit (and for each variable if there is more than one variable in the dataset). Obviously, for the last period n, Δvi(n,n+1) is not computable. The generic value Δvi(t,t+1) can take on a negative or a positive value. We define as P the set of the indices t for which Δvi(t,t+1) is positive and as N the set of the same indices for which Δvi(t,t+1) is negative.
H1.b. 
Compute, for each unit i, the value DVi defined as the modulus of the product between the sum of the positive deltas and the sum of negative deltas as follows:
DVi = |∑t∈P Δvi(t,t+1)t∈N Δvi(t,t+1)|.
This is somehow a measure of the intrinsic variability of the unit i. Indeed, in practical cases, this measures the fact that some units will be “changing” their values more than others. In case any of the ∑t∈P Δvi(t,t+1) or ∑ t∈N Δvi(t,t+1) is equal to zero, its value is changed to 1 to avoid it collapsing to zero when the intrinsic variability of a unit must be non-negative. Note that this is one of the customizable aspects, depending on the practical case under study.
H1.c. 
Compute the DMi value for each unit i as the ratio between DVi and the arithmetic mean of all DVs in the entire dataset considered as follows:
D M i = D V i i U D V i m
This value represents a normalization of the above measure of intrinsic variability. The normalization should be conducted over some homogeneous set of units to which unit i belongs. Thus, depending on the context, such a homogeneous set must be identified. For example, in the case presented in Section 3, there is strong heterogeneity in data from different national contexts (i.e., different countries). For this reason, the DVi is averaged by the mean of DVi over the country to which the unit belongs.
H1.d. 
The numerical values of the above DMi may still vary greatly. To avoid numerical instability, we compress their scale by computing the cubic root, obtaining values called RQi representing the compressed normalized intrinsic variability of the unit.
R Q i = D M i 3
H1.e. 
Compute the value GMi as the geometric mean of all the deltas in the module of unit i. This value represents an evaluation of the size of the unit. If some of the deltas are zero, then they can again be replaced with 1 to avoid them all collapsing to zero when this is not acceptable.
H1.f. 
Now, to compute a reasonable upper limit on the delta values that unit i could attain, we multiply the compressed normalized intrinsic variability by the measure of the size of the unit, obtaining the following threshold Ti:
Ti = GMi RQi
H1.g. 
Now, to finally recognize the situation of the inversion of a value between two consecutive units A and B by computing H1, we need four conditions to be verified at the same time: unit A has two consecutive deltas larger (in modulus) than the threshold TA and with opposite signs (w.l.o.g, the first is positive and the second is negative), and unit B for the same time instants has again two consecutive deltas larger (in modulus) than the threshold TB but with signs reversed with respect to A (the first is negative and the second is positive). In practice, condition H1 is given by the following Boolean expression:
H1(A,B)t: {[(ΔvA(t−1,t) > 0 ΔvA(t,t+1) < 0) vB(t,t+1) < 0 ΔvB(t,t+1) > 0)]
[(ΔvA(t−1,t) < 0 ΔvA(t,t+1) > 0) vB(t−1,t) > 0 ΔvB(t,t+1) < 0)]}
(|ΔvA(t−1,t)| > TA vA(t,t+1)|> TA vB(t−1,t)|> TB vB(t,t+1)|> TB)
If H1(A,B)t is true, then we also need a corresponding condition H2(A,B)t to be true to have a probable swap problem. The generic condition H2 is evaluated by the following steps.
H2.a. 
For each unit i, we define Iit as the distance of the value vit at time t from the mean value of v over time without the value at time t as follows:
Iit = vit − (∑k∈S\t vik)/n − 1
H2.b. 
We define Nit as the distance of the value vit at time t from the mean value of v over time without the value at time t, but this time we take the values of the subsequent unit i+1 (the one with which the values could have been exchanged), as follows:
Nit = vit − (∑k∈S\t vi+1k)/n − 1
H2.c. 
Finally, we define Fit as the minimum between the modulus of the two above values: In practice, we are comparing the distance between value vit and all the other values of unit i and between vit and all the other values of unit i + 1. If vit is closer to the values of unit i + 1, then the minimum is |Nit| and inversion is probable.
Fit = min (|Iit|, |Nit|)
Hence, condition H2 for units A and B is evaluated as follows:
H2(A,B)t: Fit|Iit|
Conditions H1 and H2 are computed and checked for every couple of units A and B and every time instant t. If H1(A,B)t is true and H2(A,B)t is also true, a possible swapping error flag is raised for units A and B at time instant t; otherwise, no flag is raised. Note that this error may even affect more than one time instant of the same two units.
Example 1 
. We provide an example of the check for the inversion problem for two units (called unit 1 and unit 2) on a variable v of a longitudinal dataset with t = 5. The data of the units are shown in Table 1. We first compute the deltas for each unit; see Table 1. For instance, unit 1 has v1 2 = 18 and v1 3 = 130, hence Δv1 (2,3) = 18 − 130 = −112. After this, DV is equal to |(−112 − 10)(107 + 10)| = 14,274 for unit 1 and |(−129)(120 + 5 + 5)| = 16,770 for unit 2. Subsequently, the value of the geometric mean GM is 33.09 for unit 1 and 24.94 for unit 2; DM is 0.92 for unit 1 and 1.08 for unit 2, and RQ is 0.97 for unit 1 and 1.03 for unit 2. Consequently, the threshold T is 32.17 for unit 1 and 25.59 for unit 2.
Then, we find H1(1,2)t. Considering that for unit 1, Δv1(1,2) > 0 and Δv1(2,3) < 0 and, for unit 2, Δv2(1,2) < 0 and Δv2(2,3) > 0, the first part of the H1 condition is verified. Additionally, all those Δv exceed the respective thresholds T. Therefore, H1(1,2)2 is true.
To evaluate H2(1,2)2, we compute I12 and N12 for unit 1 and time 2.
We have value I12 = 18 − (125 + 18 + 130 + 120 + 130−18)/4 = −108.25.
Value N12 = 18 − (21 + 150 + 30 + 25 + 20−150)/4 = −6.
Since −6 has the smallest modulus value, F12 = 6, F12≠ I12 and H2(1,2)2 is true. As both conditions are true, a probable inversion error flag is reported for the period t = 2.

2.2. Anomalous Jump Problem

To identify anomalous jumps, we now compute for each unit i a threshold with tolerance, TTi, larger than before, obtained as follows. After the computation of the threshold Ti described in Section 2.1, we execute the following steps.
  • Calculate the value LGMi as the natural logarithm of the GMi value presented in Section 2.1. This logarithm of the size represents a compressed measure of the size of the unit.
  • Compute VIi as the integer upper part of the value LGMi plus a constant c representing another element of the customization of the procedure. This value can be determined either with a priori reasoning or even derived from the data itself, for example, by defining a training set of anomalous/not anomalous jumps and by choosing the value of c maximizing the detection performance.
    VIi = ⌈LGMi + c
  • Compute GMTi as the sum of GMi + Ti. In practice, we are summing the size and threshold for unit i, obtaining a kind of deformation of the threshold by its size.
  • Finally, identify the threshold with tolerance TTi as the largest value between the two size-derived values described above. This is used as an upper bound on the reasonable jumps observed in the values of the unit.
    TTi = max(VIi, GMTi).
Now, an anomalous jump flag is raised for a unit i in a time t, t + 1 for variable v if the module of Δvi(t,t+1) is greater than the threshold TTi.
Example 2. 
We provide an example of an anomalous jump problem. Consider a unit (called unit 3) with variable v of a longitudinal dataset with t = 5. The data and the deltas of the unit are shown in Table 2. We compute the threshold T = 101.66, as already seen in the previous example. Then, we find LGM = 4.32, VI = 13 and GMT =177.15. By considering c = 8 and the mean of deltas = 10, the resulting threshold with tolerance TT value is 177.15.
As | Δv3 (2,3) |= 180 > 177.15 and |Δv3 (3,4)|= 190 > 177.15, we report an anomalous jump flag for the period t = 2, 3 and the period t = 3, 4. The data manager will have to check the values of t = 2, t = 3 and t = 4 to understand the reasons for this anomalous jump.

2.3. Recalculation Problem

To identify a recalculation operated by the data provider we use the above threshold with tolerance TTi. We suspect a recalculation problem on unit i if two contiguous deltas of opposite signs are both above the threshold TTi in the modulus as follows:
[(Δvi(t−1,t) > 0 Δvi(t,t+1) < 0) vi(t−1,t) < 0 Δvi(t−1,t) > 0)] (|Δvi(t−1,t)| > TTi vi(t,t+1)| > TTi)
If this condition is true, a possible recalculation flag is raised.
Example 3. 
We provide an example of a recalculation problem. Consider a unit (called unit 4) with variable v of a longitudinal dataset with t = 5. The data and the deltas of the unit are shown in Table 3. Following the steps described above, after computing the threshold T = 39.40, we find LGM = 3.80, VI = 12 and GMT = 84.24. The resulting TT value for the unit is 84.24. A flag of possible recalculation error is raised for period t = 3 since Δv4(2,3) > 0 and Δv4(3,4) < 0, while simultaneously |Δv4(2,3)| = 87 > 84.24 and |Δv4(3,4)| = 155 > 84.24.
All the described operations are available in the Microsoft Excel file contained in [29]. This file can be used to operate the described checks with any data, by simply pasting them into the sheet “Main Table”. Each row must represent a single unit of analysis. The Excel file is also adaptable to use units with a variable number of time instants. The minimum number of time instants must be inserted in cell MIN OSS in the sheet “Threshold Calculation”.

2.4. Data

The European Tertiary Education Register (ETER) [30] is a key initiative for understanding the higher education landscape in Europe developed after the successful AQUAMETH project [31]. This database provides a reference list of Higher Education Institutions (HEIs) and institutional data on their activities and achievements, including students, graduates, staff and finances. It thus complements national and regional education statistics provided by EUROSTAT [32].
As of March 2024, ETER includes 41 European countries and provides data from 2011 to 2020, with a total of over 3500 HEIs. ETER collects a wide range of data on HEIs, including institutional characteristics (type, size, specialization), student information (enrolment, graduates, mobility), staff (lecturers, researchers, administrative staff), finances (income, expenditure, investment) and research and development activities. ETER complies extensively with statistical regulations and manuals, in particular the UOE Manual on Data Collection on Formal Education and the OECD Frascati Manual on Research and Experimental Development Statistics. This ensures the comparability of data with other international sources. Collaboration with a network of experts and data providers in all participating countries ensures that information is collected from reliable and consistent sources. Established methodologies are used to define variables and indicators, enabling the re-use of collected data for statistical purposes and comparability with other sources. Data undergo rigorous quality control and validation to identify and correct errors or inconsistencies, as described in [33]. However, as described in Section 3, the proposed techniques were able to locate several cases of the specific longitudinal data problems described above.
ETER contributes to a better understanding of the higher education landscape and is a valuable resource for researchers, policymakers and stakeholders in European higher education. Within ETER, we selected the case of the two variables “Total academic personnel” in headcount (HC) and “Total number of enrolled students” because they are widely used in empirical analysis and by policymakers as a proxy for the size of the universities. Therefore, they are two of the most important variables, and it is of paramount importance to detect any possible errors in them.
Total academic personnel in HC include the following:
(i)
The number of academic staff whose primary assignment is instruction, research or public service;
(ii)
Staff who hold an academic rank, like professor, assistant professor, lecturer or an equivalent title;
(iii)
Staff with other titles (like dean, head of department, etc.) if their principal activity is instruction or research;
(iv)
PhD students employed for teaching assistance or research.
Data on students are divided by the level of education of the program to which they are enrolled, using the International Standard Classification of Education (ISCED) in its 2011 version. This version includes the distinction between “Bologna” levels of education (Bachelor, Master and Doctorate). The “Total number of enrolled students” includes students from ISCED 5 (short-cycle tertiary), ISCED 6 (bachelor), ISCED 7 (Master) and does not include ISCED 8 (Doctoral level).
We report our experiments on the largest EU countries present in the ETER, i.e., Germany, France, Italy, Spain, Poland and Portugal, for a total of 1587 HEIs, in the time period from 2011 to 2020. Table 4 and Table 5 report the number of HEIs having complete data, respectively, for the variables “Total academic personnel” and “Total number of enrolled students”. The time interval of these two variables is annual.

3. Results

3.1. Experiments with the Proposed Techniques

All the computations described in Section 2 have been implemented in Microsoft Excel and run directly from a spreadsheet. Those controls have been applied to the described ETER database, considering the case of the variables Total academic personnel and Total number of enrolled students. All HEIs from Germany, France, Italy, Spain, Poland and Portugal with available values for that variable were considered, for a total of 1587 HEIs. In the computation of DVi and GMi, if some factor was zero, it was replaced with 1 to avoid them collapsing to zero. In the computation of VIi, the constant c was set at 8 by means of experimental fine-tuning. Table 6 and Table 7 (respectively, for the two variables Total academic personnel and Total number of enrolled students) report, for each country, the total number of flags raised by the described techniques. In particular, we indicate the H1 and H2 flags separately and then, when both are true, the number of inversion flags. The values in the brackets show the ratio between the number of flags and the sum of all universities with available data for the considered variable in the period 2011–2020 (i.e., the column Total in Table 4 and Table 5). After that, Table 8, Table 9 and Table 10 for the variable Total academic personnel and Table 11, Table 12 and Table 13 for the variable total number of enrolled students report the years over which the error flags were raised.
As may be observed, the procedures were able to detect the described problems in every country, notwithstanding the great care taken in obtaining correct data from the different data providers. The values are often higher for Germany mainly because this country has a much larger number of HEIs. If we consider the same values divided by the number of HEIs in the country, we obtain a much more uniform distribution of the errors.
The results show a strong presence of recalculations in the dataset. This type of problem is strongly conditioned by the data collection method carried out by the ETER, which recomputes the values every year and may change from year to year in some of its definitions. Furthermore, one piece of information that unfortunately cannot be evaluated by only looking at the ETER concerns the various reforms of contractual forms that have taken place over the years in the different countries and the role conventions in the institutions (for example, in some countries like Italy, teaching assistants have been phased out as a contractual form).
The running times required by the proposed procedures, implemented in Microsoft Excel, are smaller than a few seconds for each whole country on a standard PC with i7 CPU and 16Gb of RAM running Microsoft Windows 11 OS.

3.2. Comparison with Other Existing Methods

This section contains a comparison of the proposed approach and other four error detection techniques available in the literature:
(1)
Local Outlier Factor (LOF)
(2)
Z-score threshold (Z-Score)
(3)
Interquartile Range threshold (IQR)
(4)
Hampel identifier (HI)
Since the other techniques cannot detect inversion or recalculation problems, our comparison is necessarily limited to the anomalous jump problem and is performed on a sample of Italian HEIs.
The above methods mainly operate by identifying outliers, and we apply them to the deltas to detect “outlier jumps”, roughly corresponding to our concept of an anomalous jump.
LOF compares the Local Readability Density (LRD) of a point to that of its neighbors. An LOF score of approximately 1 indicates that the LRD around the point is comparable to that of its neighbors, so the point is not an outlier. Points that have a substantially lower LRD than their neighbors are more “far away” from the others. They are considered outliers if they produce a score lying outside an interval ILOF. The minimum number of neighboring points considered was set to 3 to check each value with the two adjacent ones. The extremes of the ILOF interval are computed as the average of the distance between each pair of normalized deltas in a time series (simply the difference of the two normalized values) plus or minus the standard deviation of those distances. If a time series has missing values, this technique cannot work, and the data unit is not checked.
Z-scores quantify how far from the mean an observation is when data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls in. To calculate a Z-score for an observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, Z = (Xμ)/σ, where X is an observation, μ is the mean of the population and σ is the standard deviation of the population. The larger the Z-score, the more the value is different from the average, and values above a threshold are declared outliers. We use 2 as the threshold. The main limitation of this method regards the normal distribution assumptions: if data are not normally distributed, this approach might be not accurate.
The Interquartile Range threshold computes an interval IIQR whose width is the difference between first and third quartiles Q1 and Q3 and whose extremes are given by Q1 −1.5(Q3 − Q1) and Q3 + 1.5(Q3 − Q1). Deltas lying outside the interval IIQR are considered outliers.
The Hampel Identifier [34] computes the median η of the deltas in a time series and their Median Absolute Deviation (MAD). Then, it computes an interval IHI whose extremes are η −3MAD and η + 3MAD. Deltas lying outside the interval IHI are considered outliers.
The following Figure 1 reports the results of the anomalous jump detection of our method, described in Section 2.2, and of the above-described four methods for the case of the “Total academic staff” from a sample of the first 26 Italian HEIs in the ETER dataset. The limited size of the sample is due to the fact that the real anomalous jumps, needed to evaluate the performance of each method, were not known in advance and had to be manually identified by experts in the field for the present comparison with time-consuming inspections.
As is observable, the accuracy, defined as the percentage of correct detections over all examined cases (sum of true positive and true negative cases), is the highest for our method (91%), followed by LOF (81%). Similar experiments on the whole Italian situation show that the proposed method almost always finds a larger number of cases, and in the manual controls, performed only on a subset of the alerted cases for obvious reasons of time, the proposed technique appears to have a very good discrimination power. Therefore, the overall outcome indicates that the proposed approach has a detection power of anomalous jumps that is at least comparable, when not superior, to that of each other single technique tested.
To offer another insight, in Figure 2, we focus on a single case, which is the series of the values for the variable total academic staff of a randomly extracted real unit (unit IT0010). Here, the situation can be inspected and judged by a human, and it appears that the steepest jumps are Δ2016–2017, Δ2018–2019 and Δ2019–2020, and they have been judged anomalous by experts in the field.
The results of each method are reported in the same figure, by coloring in azure the unalerted cases and, respectively, in yellow, red, green, purple and orange the deltas detected as anomalous jumps. As is observable, our approach is the only one correctly recognizing all three cases.

4. Discussion

The issues addressed in this work arise from the analysis of large numerical longitudinal databases. This type of data is becoming more and more accessible, and they are now used for many important analyses. Unfortunately, they may contain errors, like almost any other type of data. In addition, to the generic errors commonly found in other types of data, longitudinal datasets often harbor subtle problems that generic techniques fail to trace. Those elusive types of errors should be automatically identified, so that they can be inspected and possibly removed by a human inspector. This work responds to this need by identifying three types of those errors and by proposing a statistical-mathematical approach to capture their complexity that can be applied after other generic techniques. In particular, we have identified the following types of errors:
(1)
The inversion problem, that is, the swapping of one or more values between two neighboring units;
(2)
The anomalous jump, that is, the presence of a jump between two consecutive values in a time series with the size being out of the ordinary;
(3)
The recalculation problem, which happens when the data provider discovers an error (typically a timing attribution error) on an already published value and operates a recalculation on the next value to compensate for the previous errors.
This list could even be extended in future studies. We devised techniques to identify potential errors of the described types that can be applied after a generic error correction step. We wanted these techniques to possess the following features: be computationally viable even for large datasets; work at the formal level, regardless of the meaning of the data, to be used in several contexts; be flexible to adapt to different situations. The proposed techniques are based on a system of indicators and have been implemented in a Microsoft Excel spreadsheet, publicly available in [29] from the Mendeley Data repository, to favor transparency and replicability of our experiments and to provide an easily accessible tool for anybody interested in using the proposed techniques on other datasets. We applied these techniques to an important example of a large longitudinal database, the ETER database, gathered from the different European countries and obtained by means of several passages. In this case, notwithstanding the great care spent in improving the quality of the data, several cases of the described problems were found by the proposed techniques. Thus, thanks to the described approach, the data quality of the dataset could be further improved.

5. Conclusions

When dealing with large numerical longitudinal databases, there exist errors specific to this type of dataset that are hardly identifiable or not identifiable at all by standard error detection and correction techniques. This work identifies some of these problems and proposes a statistical-mathematical approach based on a system of indicators that is able to capture the complexity of the described problems by working at the formal level, regardless of the specific meaning of the data. In particular, the types of errors analyzed in this work are as follows: (i) the inversion of one or more values from one unit to another; (ii) anomalous jumps in the series of values; (iii) errors in the temporal attribution of the values due to a recalculation operated by the data providers to compensate previous errors. The techniques to detect such errors were implemented in MS Excel and applied to the important database of European Higher Education institutions (ETER) by analyzing two key variables, namely, the total academic staff and the number of enrolled students. Note that these variables are two of the most important and delicate ones, and special care should be devoted to their correction. Each of them is often used in empirical analysis as a proxy for the size of the institutions, and they are also two of the main variables considered by policymakers at the European and national levels.
Empirical results show the effectiveness of the proposed techniques and the computational viability of the approach. Comparison with other existing techniques, which is possible only for the anomalous jump problem, reveals a superior detection power of our approach, whose accuracy on a sample reaches 91%. The implementation of the approach in Microsoft Excel makes it easy to use for researchers and functionaries working with large longitudinal databases. Moreover, it ensures the replicability of the approach and its applicability in other contexts.
Future work includes the identification of further cases of longitudinal data-specific errors and the development of techniques for their localization. With regard to the ETER dataset, we plan to extend the described techniques to other variables. This work has considerable applications and extensions. For example, it could be suggested to the national data collectors of ETER (typically the National Statistic offices) to use our approach to perform checks, already at the national level, and correct the data before sending them to ETER to maximize the accuracy of the data provided. The Excel tool provided in the work allows an easy implementation of our method on the variables of interest before providing the data.

Author Contributions

Conceptualization, R.B. and C.D.; methodology, R.B.; software, S.D.L.; writing R.B., C.D. and S.D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Sapienza research grants RM120172B870E2E2 and RM12117A8A5DBD18.

Data Availability Statement

The European Tertiary Education Register (ETER) is available from the ETER project website: https://www.eter-project.com/#/home (accessed on 10 February 2024)The Microsoft Excel file of the implementation of the proposed techniques is available from: Bruni, R., Daraio, C.; Di Leo, S. (2024), “A detection tool for longitudinal data specific errors applied to the case of European universities”, Mendeley Data, V1, doi: 10.17632/syyc7t4z54.

Acknowledgments

We thank Benedetto Lepori and Daniel Wagner-Schuster for useful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. OECD. Quality Framework and Guidelines for OECD Statistical Activities; OECD Publishing: Paris, France, 2011. [Google Scholar]
  2. Daraio, C.; Iazzolino, G.; Laise, D.; Coniglio, I.M.; Di Leo, S. Meta-choices in ranking knowledge-based organizations. Manag. Decis. 2021, 60, 995–1016. [Google Scholar] [CrossRef]
  3. Ballou, D.P.; Pazer, H.L. Modeling Data and Process Quality in Multi-Input, Multi-Output Information Systems. Manag. Sci. 1985, 31, 150–162. [Google Scholar] [CrossRef]
  4. Pipino, L.L.; Lee, Y.W.; Wang, R.Y. Data quality assessment. Commun. ACM 2002, 45, 211–218. [Google Scholar] [CrossRef]
  5. Wang, R.Y.; Strong, D.M. Beyond Accuracy: What Data Quality Means to Data Consumers. J. Manag. Inf. Syst. 1996, 12, 5–33. [Google Scholar] [CrossRef]
  6. Wang, R.Y.; Ziad, M.; Lee, Y.W. Data Quality; Springer Science & Business Media: Berlin, Germany, 2006; Volume 23. [Google Scholar]
  7. Sadiq, S. (Ed.) Handbook of Data Quality: Research and Practice; Springer Science & Business Media: Berlin, Germany, 2013. [Google Scholar]
  8. Carlo, B.; Daniele, B.; Federico, C.; Simone, G. A Data Quality Methodology for Heterogeneous Data. Int. J. Database Manag. Syst. 2011, 3, 60–79. [Google Scholar] [CrossRef]
  9. Batini, C.; Scannapieco, M. Data and Information Quality; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
  10. Corrales, D.C.; Corrales, J.C.; Ledezma, A. How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning. Symmetry 2018, 10, 99. [Google Scholar] [CrossRef]
  11. Corrales, D.C.; Ledezma, A.; Corrales, J.C. From Theory to Practice: A Data Quality Framework for Classification Tasks. Symmetry 2018, 10, 248. [Google Scholar] [CrossRef]
  12. Liu, C.; Peng, G.; Kong, Y.; Li, S.; Chen, S. Data Quality Affecting Big Data Analytics in Smart Factories: Research Themes, Issues and Methods. Symmetry 2021, 13, 1440. [Google Scholar] [CrossRef]
  13. Bruni, R. Error correction for massive datasets. Optim. Methods Softw. 2005, 20, 297–316. [Google Scholar] [CrossRef]
  14. Bruni, R.; Daraio, C.; Aureli, D. Imputation techniques for the reconstruction of missing interconnected data from higher Educational Institutions. Knowl.-Based Syst. 2020, 212, 106512. [Google Scholar] [CrossRef]
  15. Alwin, D. The Margins of Error: A Study of Reliability in Survey Measurement; Wiley-Blackwell: Hoboken, NJ, USA, 2007. [Google Scholar]
  16. Saris, W.E.; Gallhofer, I.N. Design, Evaluation, and Analysis of Questionnaires for Survey Research; Wiley: Hoboken, NJ, USA, 2007; ISBN 9780470114957. [Google Scholar]
  17. Cernat, A.; Oberski, D. Estimating Measurement Error in Longitudinal Data Using the Longitudinal MultiTrait MultiError Approach. Struct. Equ. Model. A Multidiscip. J. 2022, 30, 592–603. [Google Scholar] [CrossRef]
  18. Wang, X.; Wang, C. Time Series Data Cleaning: A Survey. IEEE Access 2019, 8, 1866–1881. [Google Scholar] [CrossRef]
  19. Hawkins, D.M. Identification of Outliers; Chapman and Hall: London, UK, 1980; Volume 11. [Google Scholar]
  20. Blázquez-García, A.; Conde, A.; Mori, U.; Lozano, J.A. A Review on Outlier/Anomaly Detection in Time Series Data. ACM Comput. Surv. 2021, 54, 1–33. [Google Scholar] [CrossRef]
  21. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 93–104. [Google Scholar]
  22. Yang, J.; Rahardja, S.; Fränti, P. Outlier detection: How to threshold outlier scores? In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, Sanya, China, 19–21 December 2019; pp. 1–6. [Google Scholar]
  23. Grubbs, F.E. Sample Criteria for Testing Outlying Observations. Ann. Math. Stat. 1950, 21, 27–58. [Google Scholar] [CrossRef]
  24. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 413–422. [Google Scholar]
  25. Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods; Springer Science & Business Media: Berlin, Germany, 1991. [Google Scholar]
  26. Oberski, D.L.; Kirchner, A.; Eckman, S.; Kreuter, F. Evaluating the Quality of Survey and Administrative Data with Generalized Multitrait-Multimethod Models. J. Am. Stat. Assoc. 2017, 112, 1477–1489. [Google Scholar] [CrossRef]
  27. Pavlopoulos, D.; Pankowska, P.; Bakker, B.; Oberski, D. Modelling error dependence in categorical longitudinal data. In Measurement Error in Longitudinal Data; Oxford University Press: Oxford, UK, 2021. [Google Scholar] [CrossRef]
  28. Batini, C.; Cappiello, C.; Francalanci, C.; Maurino, A. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 2009, 41, 1–52. [Google Scholar] [CrossRef]
  29. Bruni, R.; Daraio, C.; Di Leo, S. A Detection Tool for Longitudinal Data Specific Errors Applied to the Case of European Universities. Mendeley Data, V1. 2024. Available online: https://data.mendeley.com/datasets/syyc7t4z54/1 (accessed on 23 February 2024).
  30. ETER Project Website. Available online: https://www.eter-project.com/#/home (accessed on 23 February 2024).
  31. Daraio, C.; Bonaccorsi, A.; Geuna, A.; Lepori, B.; Bach, L.; Bogetoft, P.; Cardoso, M.F.; Castro-Martinez, E.; Crespi, G.; de Lucio, I.F.; et al. The European university landscape: A micro characterization based on evidence from the Aquameth project. Res. Policy 2011, 40, 148–164. [Google Scholar] [CrossRef]
  32. Lepori, B.; Bonaccorsi, A.; Daraio, A.; Daraio, C.; Gunnes, H.; Hovdhaugen, E.; Ploder, M.; Scannapieco, M.; Wagner-Schuster, D. Establishing a European Tertiary Education Register; Publications Office of the European Union: Luxembourg, 2016; ISBN 978-92-79-52368-7. [Google Scholar] [CrossRef]
  33. Daraio, C.; Bruni, R.; Catalano, G.; Daraio, A.; Matteucci, G.; Scannapieco, M.; Wagner-Schuster, D.; Lepori, B. A Tailor-made Data Quality Approach for Higher Educational Data. J. Data Inf. Sci. 2020, 5, 129–160. [Google Scholar] [CrossRef]
  34. Hampel, F.R. The influence curve and its role in robust estimation. J. Am. Stat. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
Figure 1. Percentages of anomalous jumps detected by each method.
Figure 1. Percentages of anomalous jumps detected by each method.
Symmetry 16 00529 g001
Figure 2. Anomalous jumps detected by each method on a specific case.
Figure 2. Anomalous jumps detected by each method on a specific case.
Symmetry 16 00529 g002
Table 1. Values v and Δ of the inversion problem example.
Table 1. Values v and Δ of the inversion problem example.
v1v2v3v4v5Δ(1,2)Δ(2,3)Δ(3,4)Δ(4,5)
Unit 112518130120130107−11210−10
Unit 221150302520−12912055
Table 2. Values and Δ of the unit considered for the anomalous jump example.
Table 2. Values and Δ of the unit considered for the anomalous jump example.
v1v2v3v4v5Δ (1,2)Δ (2,3)Δ (3,4)Δ (4,5)
Unit 3200220400210230−20−180190−20
Table 3. Values and Δ of the unit considered for the recalculation example.
Table 3. Values and Δ of the unit considered for the recalculation example.
v1v2v3v4v5Δ (1,2)Δ (2,3)Δ (3,4)Δ (4,5)
Unit 416316780235160−487−15575
Table 4. Number of HEIs with the variable total academic staff (HC) available in the ETER for each country and year in the period 2011–2020.
Table 4. Number of HEIs with the variable total academic staff (HC) available in the ETER for each country and year in the period 2011–2020.
2011201220132014201520162017201820192020Total
Italy1151151151141141141141141141141143
Germany3653783833853853834004003963993874
Spain77808081818082838384811
France13113213012912601231231191111124
Poland000000247243241237968
Portugal1131069491909590908992950
Table 5. Number of HEIs with the variable total number of enrolled students available in the ETER for each country and year in the period 2011–2020.
Table 5. Number of HEIs with the variable total number of enrolled students available in the ETER for each country and year in the period 2011–2020.
2011201220132014201520162017201820192020Total
Italy1761761762152152162062072072082002
Germany3683803843883893914024014013993903
Spain77808081828282838384814
France33934036837537702122092062032629
Poland2862722802822812742482452432392650
Portugal1131069491909690908992951
Table 6. Total number of flags raised for variable total academic staff by countries.
Table 6. Total number of flags raised for variable total academic staff by countries.
# of H1 Flags# of H2 Flags#of Inversions Flags# of Jumps Flags# of Recalculation Flags
Italy159 (0.14)287 (0.25)40 (0.03)396 (0.35)58 (0.05)
Germany314 (0.08)398 (0.10)34 (0.01)1059 (0.27)32 (0.01)
Spain24 (0.03)81 (0.10)4 (0.005)249 (0.31)21 (0.03)
France18 (0.02)20 (0.02)1 (0.00)160 (0.14)5 (0.004)
Poland79 (0.08)71 (0.07)12 (0.01)9 (0.01)18 (0.02)
Portugal50 (0.05)131 (0.14)7 (0.01)236 (0.25)32 (0.03)
Table 7. Total number of flags raised for variable total enrolled students by countries.
Table 7. Total number of flags raised for variable total enrolled students by countries.
# of H1 Flags# of H2 Flags#of Inversions Flags# of Jumps Flags# of Recalculation Flags
Italy151 (0.08)192 (0.1)18 (0.01)728 (0.36)111 (0.06)
Germany150 (0.04)553 (0.14)24 (0.01)1452 (0.37)99 (0.03)
Spain35 (0.04)86 (0.11)5 (0.01)289 (0.36)17 (0.02)
France96 (0.04)638 (0.24)27 (0.01)795 (0.3)193 (0.07)
Poland85 (0.03)428 (0.16)12 (0)871 (0.33)41 (0.02)
Portugal23 (0.02)166 (0.17)5 (0.01)256 (0.27)13 (0.01)
Table 8. Number of inversion flags raised by country and by year (2011–2020).
Table 8. Number of inversion flags raised by country and by year (2011–2020).
2011201220132014201520162017201820192020Total
Italy5100234451640
Germany9110221321334
Spain40000000004
France00000000011
Poland000000092112
Portugal20002002017
Table 9. Number of anomalous jump flags raised by country and by delta.
Table 9. Number of anomalous jump flags raised by country and by delta.
Δ2011–
2012
Δ2012–
2013
Δ2013–
2014
Δ2014–
2015
Δ2015–
2016
Δ2016–
2017
Δ2017–
2018
Δ2018–
2019
Δ2019–
2020
Total
Italy425253423940384248396
Germany1321441061111151101121011281059
Spain262115583130192623249
France10256160012109160
Poland1111111119
Portugal312825303417203219236
Table 10. Number of recalculation flags raised by country and by year (2011–2020).
Table 10. Number of recalculation flags raised by country and by year (2011–2020).
2011201220132014201520162017201820192020Total
ItalyN.A.591365596N.A.58
GermanyN.A.02068367N.A.32
SpainN.A.30344223N.A.21
FranceN.A.21000011N.A.5
PolandN.A.000000180N.A.18
PortugalN.A.02068367N.A.32
Table 11. Number of inversion flags raised by country and by year (2011–2020).
Table 11. Number of inversion flags raised by country and by year (2011–2020).
2011201220132014201520162017201820192020Total
Italy611000002818
Germany1610203110024
Spain40000000015
France1624000000527
Poland600210000312
Portugal40001000005
Table 12. Number of anomalous jump flags raised by country and by delta.
Table 12. Number of anomalous jump flags raised by country and by delta.
Δ2011–
2012
Δ2012–
2013
Δ2013–
2014
Δ2014–
2015
Δ2015–
2016
Δ2016–
2017
Δ2017–
2018
Δ2018–
2019
Total
Italy83606786116927572728
Germany2081901781451421511331391452
Spain3239493940211828289
France94777530200118107795
Poland12413910683881585450871
Portugal5635182217172728256
Table 13. Number of recalculation flags raised by country and by year (2011–2020).
Table 13. Number of recalculation flags raised by country and by year (2011–2020).
2011201220132014201520162017201820192020Total
ItalyN.A.1096153319712N.A.111
GermanyN.A.131112101781315N.A.99
SpainN.A.43510211N.A.17
FranceN.A.1917490001035N.A.193
PolandN.A.577313402N.A.41
PortugalN.A.61030102N.A.13
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bruni, R.; Daraio, C.; Di Leo, S. Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets. Symmetry 2024, 16, 529. https://doi.org/10.3390/sym16050529

AMA Style

Bruni R, Daraio C, Di Leo S. Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets. Symmetry. 2024; 16(5):529. https://doi.org/10.3390/sym16050529

Chicago/Turabian Style

Bruni, Renato, Cinzia Daraio, and Simone Di Leo. 2024. "Flexible Techniques to Detect Typical Hidden Errors in Large Longitudinal Datasets" Symmetry 16, no. 5: 529. https://doi.org/10.3390/sym16050529

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop