Weighted Least Squares Regression with the Best Robustness and High Computability

Zuo, Yijun; Zuo, Hanwen

doi:10.3390/axioms13050295

Open AccessArticle

Weighted Least Squares Regression with the Best Robustness and High Computability

by

Yijun Zuo

^1,*

and

Hanwen Zuo

²

¹

Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA

²

Department of Computer Science, Michigan State University, East Lansing, MI 48824, USA

^*

Author to whom correspondence should be addressed.

Axioms 2024, 13(5), 295; https://doi.org/10.3390/axioms13050295

Submission received: 21 March 2024 / Revised: 15 April 2024 / Accepted: 21 April 2024 / Published: 27 April 2024

(This article belongs to the Special Issue New Perspectives in Mathematical Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

A novel regression method is introduced and studied. The procedure weights squared residuals based on their magnitude. Unlike the classic least squares which treats every squared residual as equally important, the new procedure exponentially down-weights squared residuals that lie far away from the cloud of all residuals and assigns a constant weight (one) to squared residuals that lie close to the center of the squared-residual cloud. The new procedure can keep a good balance between robustness and efficiency; it possesses the highest breakdown point robustness for any regression equivariant procedure, being much more robust than the classic least squares, yet much more efficient than the benchmark robust method, the least trimmed squares (LTS) of Rousseeuw. With a smooth weight function, the new procedure could be computed very fast by the first-order (first-derivative) method and the second-order (second-derivative) method. Assertions and other theoretical findings are verified in simulated and real data examples.

Keywords:

weighted least squares; robustness; efficiency; computability; finite sample breakdown point

MSC:

62G35; 62J05; 62G99; 62G05

1. Introduction

In the classical regression analysis, we assume that there is a relationship for a given data set

{{(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, 2, \dots, n}}

:

y_{i} = (1, x_{i}^{⊤}) β_{0} + e_{i}, i \in {1, \dots, n}

(1)

where

y_{i} \in R^{1}

, ⊤ stands for the transpose,

β_{0} = {(β_{01}, \dots, β_{0 p})}^{⊤}

(the true unknown parameter) in

R^{p}

and

x_{i} = {(x_{i 1}, \dots, x_{i (p - 1)})}^{⊤}

in

R^{p - 1}

(

p \geq 2

),

e_{i} \in R^{1}

is called an error term (or random fluctuation/disturbances, which is usually assumed to have zero mean and variance

σ^{2}

in classic regression theory). That is,

β_{01}

is the intercept term of the model. We write

w_{i} = {(1, x_{i}^{'})}^{⊤}

, then one has

y_{i} = w_{i}^{⊤} β_{0} + e_{i}

, which is used interchangeably with (1).

One wants to estimate the

β_{0}

based on a given sample

z^{(n)} : = {{(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, \dots, n}}

from the model

y = (1, x^{⊤}) β_{0} + e

. Call the difference between

y_{i}

and

w_{i}^{⊤} β

the i-th residual

r_{i} (β)

for a candidate coefficient vector

β

(which is often suppressed). That is,

r_{i} : = r_{i} (β) = y_{i} - w_{i}^{⊤} β .

(2)

To estimate

β_{0}

, the classic least squares (LS) minimizes the sum of squares of residuals,

{\hat{β}}_{l s} = arg min_{β \in R^{p}} \sum_{i = 1}^{n} r_{i}^{2} .

Alternatively, one can replace the square above by the absolute value to obtain the least absolute deviations estimator (i.e.,

L_{1}

estimator, in contrast to the

L_{2}

(LS) estimator).

The LS estimator is very popular in practice across a broader spectrum of disciplines due to its great computability and optimal properties when the error

e_{i}

s is i.i.d. and follows a normal

N (0, σ^{2})

distribution. It, however, can behave badly when the error distribution is slightly departed from the normal distribution, particularly when the errors are heavy-tailed or contain outliers.

Robust alternatives to the

{\hat{β}}_{l s}

have abounded in the literature for a long time. The most popular ones are M-estimators [1], the least median squares (LMS) and least trimmed squares (LTS) estimators [2], S-estimators [3], MM-estimators [4],

τ

-estimators [5], maximum depth estimators ([6,7]), and the recent least squares of trimmed residuals (LST) regression [8], among others. For more related discussions, see Sections 1.2 and 4.4 of [9], and Section 5.14 of [10].

Robust methods that have a high breakdown point are usually computationally intensive and with a non-differentiable objective function (e.g., LMS, LTS, and LST). In this article, we will introduce a smooth and differentiable objective function that greatly facilitates the computation of the underlying estimator. We introduce a new class of alternatives for robust regression, weighted least squares (WLS) estimators

{\hat{β}}_{w l s}

:

{\hat{β}}_{w l s} = arg min_{β \in R^{p}} \sum_{i}^{n} w_{i} r_{i}^{2} (β),

(3)

where

w_{i}

is the weight associated with

r_{i}

with a fundamental feature: it assigns equal weight to all

r_{i}^{2}

that are small (no greater than a cut-off value) and exponentially down-weights (penalizes) the large ones (when

r_{i}^{2}

s are greater than the cut-off value).

Weighted least squares estimation has been proposed and discussed in the literature, including the famous Huber’s M-estimators which, however, can have the lowest breakdown point if the derivative of the weight (or loss) function is non-decreasing; see [9] (p. 13) or [10,11]. For more discussions, see Section 1.2 of [9] or Section 5.11 of [10]. Previous weight functions in the literature are either constants (e.g., LS with 1, or LMS and LTS with 0 and 1 weight), rank-based weight, do not down-weight large residuals sufficiently, or non-differentiable. Among these weight-induced regression estimators, there are few that possess a high breakdown point (50%), a high efficiency, and a high computability, simultaneously.

On the other hand, there is much room for smooth weight functions. Successful examples in location setting have already appeared in the literature, e.g., [12]. This motivates us to extend those smooth weight functions to regression setting and to achieve a high breakdown point and high efficiency and high computablity simultaneously. We propose using a differentiable

w (r)

, which would assign weight one to

r_{i}

s that lies close to the center of the data (all

r_{i}

s) cloud. The other points that lie on the outskirts of the data (all

r_{i}

s) cloud could be viewed as outliers, so a lower positive weight (not zero) should be given. This would balance efficiency with robustness. The weighted procedure proposed in this article has never appeared before. The specially chosen

w_{i}

’s in (3) will recover the famous LMS and the LTS in [2], and LST in [8]. More discussions on w and

{\hat{β}}_{w l s}

are carried out in Section 2, where an ad hoc choice of the weight function with the above property in mind will be introduced.

The rest of this article is organized as follows. Section 2 introduces a class of differentiable weight functions and a class of weighted least squares estimators. Section 3 establishes the existence of

{\hat{β}}_{w l s}

and studies its properties including its finite sample breakdown robustness. Section 4 discusses the computation of

{\hat{β}}_{w l s}

. Section 5 presents some concrete examples, comparing the performance of

{\hat{β}}_{w l s}

with other leading estimators. Section 6 ends the article with some concluding remarks. Long proofs of the main results are deferred to in Appendix A.

2. A Class of Weighted Least Squares

2.1. A Class of Weight Functions

An ad hoc choice of the weight function with the property mentioned in Section 1 takes the form of

w (x) = 1 (| x | \leq c) + 1 (| x | > c) \frac{e^{- k {(1 - c / | x |)}^{2}} - e^{- k}}{1 - e^{- k}}, \forall c, k > 0,

(4)

where the tuning parameter

k > 1

is a positive number (say, between 1 and 10) controlling the steepness of the exponential decrease (see the left panel of Figure 1), where the larger the k, the steeper the curve (the key difference from the trimmed procedures where the weight becomes zero). Tuning parameter c is the point where the weight function will change from a constant one to being exponentially decreasing. c (>1) usually can be set to be a large positive number (say 10), or it can be residual dependent, say 50% or 75% percentile of the residuals, and a larger c is favorable for higher efficiency. c is assumed to be finite to exclude the LS case (i.e.,

w (x)

will not be a constant one).

One of the examples of

w (x)

is given in Figure 1, where

w (x)

and its derivative are given and

k = 5

and

c = 100

. For a general

w (x)

, it is straightforward to verify that

P1: $w (x)$ is twice differentiable and $0 < w (x) \leq 1$ . When $x \to \infty$ , $w (x)$ is asymptotically equivalent to $α (e^{γ x^{- 1}} - 1)$ for some positive constants $α$ and $γ$ .
P2: If $r_{i} \to \infty$ , then $w (r_{i}^{2} / c^{*}) r_{i}^{2} \to 2 c k c^{*} / (e^{k} - 1)$ , where $c^{*} : = {M e d}_{i} {y_{i}^{2}}$ , the median of ${y_{1}^{2}, y_{2}^{2}, \dots, y_{n}^{2}}$ .

2.2. Weighted Least Squares Estimators

With the weight function given above, we are ready to specify the weighted least squares estimator in (3) with more details:

{\hat{β}}_{w l s} = arg min_{β \in R^{p}} \sum_{i}^{n} w_{i} r_{i}^{2} (β),

(5)

where weight

w_{i} : = w (r_{i}^{2} / c^{*})

, with

w (x)

being a weight function in (4) that satisfies P2 and

c^{*}

defined in P2.

The behavior of function

w (r^{2} / c^{*}) r^{2}

when

r > c

for different

c^{*}

s is illustrated in Figure 2 below. Inspecting the figure reveals that it is strictly convex.

3. Properties of the ${\hat{β}}_{w l s}$

3.1. Existence

Does the minimizer of the objective function

O (β, z^{(n)}) : = \sum_{i}^{n} w_{i} r_{i}^{2} (β)

on the right-hand side (RHS) of (5) exist? We now formally address this. We need the following assumption.

A1: For a given sample

z^{(n)} : = {{(z_{i})}_{i = 1}^{n}} = {{(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, 2, \dots, n}}

and any

β \in R^{p}

, all points

{{(x_{i}^{⊤}, y_{i})}^{⊤}}

with

r_{i}

s satisfying

r_{i}^{2} / c^{*} \leq c

do not lie in a vertical hyperplane.

The assumption holds true with probability one if the sample comes from a distribution of

{(x^{⊤}, y)}^{⊤}

that has a density. Now, we have the following existence result.

Theorem 1.

If A1 holds true, then the minimizer

{\hat{β}}_{w l s}

of

O (β, z^{(n)})

always exists.

Proof.

See the Appendix A. □

3.2. Equivariance

Desirable fundamental properties of regression estimators include regression, scale, and affine equivarince. For

x \in R^{n \times (p - 1)}

and

y \in R^{n}

, a regression estimator

\hat{β} : = t (w, y)

with

w = {(1, x^{⊤})}^{⊤}

satisfying

t (w, y + b^{⊤} w) = t (w, y) + b, \forall b \in R^{p};

(6)

t (w, s y) = s t (w, y), \forall s \in R;

(7)

t (A^{⊤} w, y) = A^{(- 1)} t (w, y), \forall nonsingular A \in R^{p \times p} .

(8)

is called regression, scale, and affine equivariant, respectively (see page 116 of [9]). All aforementioned regression estimators are regression, scale, and affine equivariant.

Theorem 2.

{\hat{β}}_{w l s}

defined in (3) is regression, scale, and affine equivariant.

Proof.

Notice the identities

r_{i} = y_{i} - w_{i}^{⊤} β = y_{i} + b^{⊤} w_{i} - w_{i}^{⊤} (β + b)

,

s r_{i} = s y_{i} - w_{i}^{⊤} (s β)

, and

r_{i} = y_{i} - {(A^{⊤} w_{i})}^{⊤} A^{- 1} β

. Meanwhile,

r_{i}^{2} / c^{*}

is regression, scale, and affine invariant. The desired result follows. □

3.3. Robustness

As an alternative to the least-squares

{\hat{β}}_{l s}

, is the

{\hat{β}}_{w l s}

more robust?

The most prevailing quantitative measure of the global robustness of any location or regression estimators in the finite sample practice is the finite sample breakdown point (FSBP), introduced in [13].

Roughly speaking, the FSBP is the minimum fraction of ‘bad’ (or contaminated) data that the estimator can be affected by to an arbitrarily large extent. For example, in the context of estimating the center of a data set, the sample mean has a breakdown point of

1 / n

(or 0%) because even one bad observation can change the mean by an arbitrary amount; in contrast, the sample median has a breakdown point of

⌊ (n + 1) / 2 ⌋ / n

(or 50), where

⌊ \cdot ⌋

is the floor function.

Definition 1

([13]). The finite sample replacement breakdown point (RBP) of a regression estimator

t

at the given sample

z^{(n)} = {z_{1}, \dots, z_{n}}

, where

z_{i} : = {(x_{i}^{⊤}, y_{i})}^{⊤}

, is defined as

R B P (t, z^{(n)}) = min_{1 \leq m \leq n} \{\frac{m}{n} : sup_{z_{m}^{(n)}} ∥ t (z_{m}^{(n)}) - t (z^{(n)}) ∥ = \infty\},

(9)

where

z_{m}^{(n)}

denotes an arbitrary contaminated sample by replacing m original sample points in

z^{(n)}

with arbitrary points in

R^{p}

. Namely, the RBP of an estimator is the minimum replacement fraction that could drive the estimator beyond any bound. It turns out that both

L_{1}

(least absolute deviations) and

L_{2}

(least squares) estimators have RBP

1 / n

(or 0%), the lowest possible value, whereas

{\hat{β}}_{w l s}

can have

(⌊ (n - p) / 2 ⌋ + 1) / n

(or 50%), the highest possible value for any regression equivariant estimators (see p. 125 of [9]).

We shall say

z^{(n)}

is in a general position when any p of observations in

z^{(n)}

gives a unique determination of

β

. In other words, any (p-1) dimensional subspace of the space

{(x^{⊤}, y)}^{⊤}

contains at most p observations of

z^{(n)}

. When the observations come from continuous distributions, the event (

z^{(n)}

being in the general position) happens with a probability of one.

Theorem 3.

Assume that A1 holds true,

n > p

, and

z^{(n)}

is in the general position. Then,

R B P ({\hat{β}}_{w l s}^{n}, z^{(n)}) = \{\begin{matrix} ⌊ (n + 1) / 2 ⌋ / n, & i f p = 1, \\ (⌊ (n - p) / 2 ⌋ + 1) / n, & i f p > 1 . \end{matrix}

(10)

Proof.

see the Appendix A. □

We need the following important result for the Proof of Theorem 3.

Lemma 1.

For any

r_{i}^{2} > r_{j}^{2} > c^{*} c

,

w (r_{i}^{2} / c^{*}) r_{i}^{2} < w (r_{j}^{2} / c^{*}) r_{j}^{2}

when

r_{j}^{2} \to \infty

.

Proof.

See the Appendix A. □

Remark 1.

The RBP result in Theorem 3 is the highest possible breakdown point for any regression equivariant estimators in the literature (see p. 125 of [9]). There are very few regression estimators that possess the highest breakdown point robustness.

4. Computation of the WLS

Now, we address the most important issue with a high breakdown point estimator, the computation of the estimator. The objective function in (5) is

O (β) : = O (β, z^{(n)}) = \sum_{i = 1}^{n} w (r_{i}^{2} / c^{*}) r_{i}^{2},

(11)

which is differentiable with respect to

β

since the weight function

w (x^{2} / c^{*})

is twice differentiable with

\begin{matrix} w^{'} (x) & = α^{*} e^{- k {(1 - c / | x |)}^{2}} (1 - c / | x |) sgn (x) / x^{2} 1 (| x | > c), \\ w^{″} (x) & = α^{*} e^{- k {(1 - c / | x |)}^{2}} (- 2 k c {(1 - c / | x |)}^{2} / | x | - (2 - 3 c / | x |)) / x^{3} 1 (| x | > c), \end{matrix}

(12)

where

α^{*} = - 2 k c / (1 - e^{- k})

. The problem in (3) belongs to an unconstrained minimization. This type of problem has been thoroughly discussed and studied in the literature. Common approaches to find the solution include (i) methods utilizing first-order derivatives (gradient descent/steepest descent/conjugate gradient), (ii) methods using second-order derivatives (Hessian matrix) (Newton’s method), and (iii) quasi-Newton method, see [14,15]. We will select the conjugate gradient for speed/efficiency and accuracy consideration.

Note that

\begin{matrix} \nabla O (β) = \frac{\partial O (β)}{\partial β} & = \sum_{i = 1}^{n} (w^{'} (r_{i}^{2} / c^{*}) r_{i}^{2} + c^{*} w (r_{i}^{2} / c^{*})) \frac{\partial r_{i}^{2} / c^{*}}{\partial β} \\ = \sum_{i = 1}^{n} (w^{'} (r_{i}^{2} / c^{*}) r_{i}^{2} + c^{*} w (r_{i}^{2} / c^{*})) 2 r_{i} / c^{*} (- w_{i}) \\ = \sum_{i = 1}^{n} - 2 r_{i} / c^{*} (w^{'} (r_{i}^{2} / c^{*}) r_{i}^{2} + c^{*} w (r_{i}^{2} / c^{*})) w_{i} . \end{matrix}

(13)

\begin{matrix} \nabla^{2} O (β) = \frac{\partial^{2} O (β)}{\partial^{2} β} & = \frac{- 2}{c^{*}} \sum_{i = 1}^{n} \frac{\partial (r_{i} (w^{'} (r_{i}^{2} / c^{*}) r_{i}^{2} + c^{*} w (r_{i}^{2} / c^{*}))}{\partial β} w_{i} \\ = \frac{- 2}{c^{*}} \sum_{i = 1}^{n} w_{i}^{⊤} w_{i} (5 r_{i}^{2} w^{'} (\frac{r_{i}^{2}}{c^{*}}) + c^{*} w (\frac{r_{i}^{2}}{c^{*}}) + 2 \frac{r_{i}^{4}}{c^{*}} w^{″} (\frac{r_{i}^{2}}{c^{*}})) \\ = X_{n}^{⊤} W X_{n}, \end{matrix}

(14)

where

X_{n}^{⊤} = (w_{1}^{⊤}, \dots, w_{n}^{⊤})

,

W

is a diagonal matrix with its i-th diagonal entry

- 2 γ_{i} / c^{*}

and

γ_{i} = 5 r_{i}^{2} w^{'} (\frac{r_{i}^{2}}{c^{*}}) + c^{*} w (\frac{r_{i}^{2}}{c^{*}}) + 2 \frac{r_{i}^{4}}{c^{*}} w^{″} (\frac{r_{i}^{2}}{c^{*}}) .

Write

γ_{i} / c^{*}

as

g (t_{i})

, then

g (t_{i}) = 5 t_{i} w^{'} (t_{i}) + 2 t_{i}^{2} w^{″} (t_{i}) + w (t_{i})

, where

t_{i} = r_{i}^{2} / c^{*} > c

and

g (t) < 0

for

t > c

for different

c > 0

as indicated below in Figure 3. Namely,

W

is positive definite when

t_{i} > c

.

The algorithm for the conjugate gradient method (CGM) is as follows:

(i)

Step 1. Pick a

β^{0}

(which can be an LS estimator, but for robustness, the LTS ([2]) or LST ([8]) is a better choice). Set

v^{0} = - \nabla O (β^{0})

. Set a tolerance

ε

. if

(∥ v^{0} ∥ < ε)

{return

β^{0}

}.

(ii)

Step 2. For

k = 0, 1, \dots, n - 1

,

(a): Set $β^{k + 1} = β^{k} + α^{k} v^{k}$ , where $α^{k}$ is the minimizer of $O (β^{k} + α v^{k})$ with respect to $α$ (using backtracking line search, see page 464 of [14]), or set

$α^{k} = - \nabla^{⊤} O (β^{k}) v^{k} / {(v^{k})}^{⊤} H (β^{k}) v^{k},$

where $H (β^{k}) = \nabla^{2} (O (β^{k}))$ .
(b): Compute $\nabla O (β^{k + 1})$ , if $(∥ \nabla O (β^{k + 1}) ∥ < ε)$ {return $β^{k + 1}$ }.
(c): If ( $k = n - 1$ ) {break}; else set $v^{k + 1} = - \nabla O (β^{k + 1}) + α^{k} v^{k}$ , where

$α^{k} = \nabla^{⊤} O (β^{k + 1}) \nabla O (β^{k + 1}) / \nabla^{⊤} O (β^{k}) \nabla O (β^{k})$

end for loop.

(iii)

Step 3. Replace

β^{0}

by

β^{n}

and go to step 1.

Convergence of the gradient algorithm or gradient descent method to the global minimum has been thoroughly analyzed on pp. 466–467 of Boyd and Vandenberghe (2004) [14]. The global convergence of conjugate gradient methods specifically has been addressed in Gilbert and Nocedal (1992) [16].

5. Examples and Comparison

Now, we investigate the performance of our new procedure WLS and compare it with some leading competitors including the robust benchmark, the least trimmed squares LTS estimator, Rouseeuw [2] (known for its high robustness and fast computation), the MM estimator of Yohai [4] (known for its high robustness and high efficiency), and the classical least squares LS estimator (known for its high efficiency for i.i.d. normal errors) via some concrete examples.

5.1. Performance Criteria

Empirical mean squared error (EMSE) For a general regression estimator $t$ , we calculate $EMSE : = \sum_{i = 1}^{R} {∥ t_{i} - β_{0} ∥}^{2} / R$ , the empirical mean squared error (EMSE) for $t$ . If $t$ is regression equivariant, then we can assume (w.l.o.g.) that the true parameter $β_{0} = 0 \in R^{p}$ (see p.124 of [9]). Here, $t_{i}$ is the realization of $t$ obtained from the i-th sample with size n and dimension p, and replication number R is usually set to be 1000.
Total time consumed for all replications in the simulation (TT) This criterion measures the speed of a procedure, where the faster and more accurate, the better.One possible issue is the fairness of comparison of different procedures because different programming languages (e.g., C, Rcpp, Fortran, and R) are employed by different procedures.
Finite sample relative efficiency (FSRE) In the following, we investigate via simulation studies the finite-sample relative efficiency of the different robust alternatives of the LS with respect to the benchmark, the classical least squares line/hyperplane. The latter is optimal with normal models by the Gauss–Markov theorem. We generate $R = 1000$ samples from the linear regression model: $y_{i} = β_{0} + β_{1} x_{1} + \dots + β_{p - 1} x_{p - 1} + e_{i}, i \in {1, \dots, n}$ with different sample size ns and dimension ps, where $e_{i} \sim N (0, σ^{2})$ . The finite sample RE of a procedure is the percentage of EMSE of the LS divided by the EMSE of the procedure.

All R code (downloadable via https://github.com/left-github-4-codes/WLS) accessed on 19 March 2024 for simulation, examples, and figures in this article were run on a desktop Intel(R)Core(TM) 21 i7-2600 CPU @ 3.40 GHz.

5.2. Examples

In the sequel, the cutoff value

ε

is set to be

10^{- 4}

for the procedure WLS. For simplicity, we set the tuning parameters

c = k = 6

for the weight function.

Example 1

(Simple linear regression). To take the advantage of the graphical illustration of data sets and plots, we start with

p = 2

, the simple linear regression case.

We generated a data set with seven artificial highly correlated (with correlation

0.88

between x and y) bi-variate normal points. It is plotted in the left panel of Figure 4. Two reference regression lines (

y = 0

) and (

y = x

) are also provided.

Inspecting the left panel of the figure immediately reveals that points 5 and 6 seem to be outliers and the overall pattern of the data set is linear

y = c x

with

c > 0

. The right panel further reveals that the LS, the LTS, and the MM lines are very sensitive to the outlying points, whereas WLS still can catch the overall line pattern under the influence of two outliers.

One might immediately argue that the example above has at least two drawbacks: (i) the data set is too small, and (ii) it is purely artificial. In Figure 5, the sample size is increased to 80 highly correlated normal points with

30 %

of them contaminated by other normal points. Inspecting the figure reveals that the four procedures capture the linear pattern perfectly in the left panel of the figure for perfect bivariate normal points, while in the right panel, the LTS, MM, and LS lines are drastically changed due to the 24 contaminating points, while WLS well resists the influence of outliers, catching the original overall linear pattern.

In practice, there are more cases with more than one independent variable: in the following, we consider the case

p > 2

.

Example 2

(Multiple linear regression with contaminated normal points). Now, we do not have the visual advantage like in the

p = 2

case. To compare the performance of different procedures, we have to appeal the performance measures discussed in Section 5.1.

We consider the contaminated highly correlated normal data points scheme. We generate 1000 samples

{z_{i} = {(x_{i}^{⊤}, y_{i})}^{⊤}, i \in {1, \dots, n}}

with various ns from the normal distribution

N (μ, Σ)

, where

μ

is a zero vector in

R^{p}

, and

Σ

is a p by p matrix with diagonal entries being 1 and off-diagonal entries being

0.9

. Then,

ε %

of them are contaminated by

m = ⌈ n ε ⌉

points, where

⌈ \cdot ⌉

is the ceiling function. We randomly select m points of

{z_{i}

,

i \in {1, \dots, n}}

and replace them by

{(3, 3, \dots, 3, - 3)}^{⊤}

.

The performance of the CGM in Section 4 (or rather any iterative procedure) severely depends on the initial given point

β^{0}

. In light of its cyclic feature of the CGM for non-quadratic objective function (see page 195 of [15]) and our extensive empirical simulation experience, the performance of the

β

return by the CGM usually is not very different from (or better than) that of the initially selected

β^{0}

. To achieve better performance for the WLS, we modified the LST of Zuo and Zuo [8] and utilized it as the initial

β^{0}

for CGM. Results for the three methods and different ns and ps and contamination levels

ε

are listed in Table 1.

Inspecting the table reveals that (i) LS is the fastest in all cases considered and the best performer for pure normal data sets, except the case

p = 20

and

n = 200

, where WLS is even slightly more efficient. It, however, becomes the worst performer when there is contamination (except the

ε = 0.30

cases, where the LTS and MM surprisingly become the worse performers. In theory, both MM and LTS can resist up to 50% contamination without breakdown). (ii) WLS has the smallest EMSE when there is contamination and this is true even with no contamination when

p = 20

and

n = 200

. It is also the second fastest performer (except in the case

ε = 0.3

and

p = 5

or 10, where MM is faster). (iii) LTS is inferior to WLS in all cases and so is the MM (except it runs faster when

ε = 0.3

and

p = 5

or 10). (iv) MM performs better than LTS in TT and in EMSE (except when

p = 20

and

ε = 0.0

, 0.10, or 0.20).

Example 3

(Performance when $β^{0}$ is given). In the calculation of EMSE above, one assumes that

β^{0} = 0

in light of regression equivariance of an estimator

t

. In this example, we will provide

β^{0}

(for convenience, write it as

β_{0}

) and calculate

y_{i}

using the formula

y_{i} = (1, x_{i}^{⊤}) β_{0}^{⊤} + e_{i}

, where we simulate

x_{i}

from a normal distribution with a zero mean vector and an identical covariance matrix.

e_{i}

follows a standard normal distribution.

We set

p = 10

,

n = 100

and

β_{0} = (1, 1, 1, 1, 1, - 1, - 1, - 1, - 1, - 1)

. There is a

ε = 10 %

contamination for each of 1000 normal samples (generated as in Example 2) with the contamination scheme as follows: we randomly select

m = ⌈ n ε ⌉

points out of

{z_{i}

,

i \in {1, \dots, n}}

and replace them by

{(4.5, 4.5, \dots, 4.5)}^{⊤}

. We then calculate the squared deviation (SD)

{({\hat{β}}_{i} - β_{0})}^{2}

for each sample, the total time (TT) consumed by each procedures for all 1000 samples, and the relative efficiency (RE) (the ratio of EMSE of LS vs. EMSE of a procedure). The performance of three procedures for different criteria are displayed via the boxplot in Figure 6.

Inspecting the figure reveals that (i) in terms of squared deviations, LTS and LS perform the same, where both have a wide spread and a high EMSE, whereas MM has a much smaller EMSE, and WLS has the smallest EMSE (in fact, the EMSE for the four (mm, wls, lts, ls) are (1.188640, 1.037962, 2.245551, 2.245551)). (ii) In terms of total time consumed, LS is the absolute winner, LTS is the absolute loser, and WLS is much better than LTS and slightly better than MM. (iii) In terms of relative efficiency, LTS is the loser (performs as bad as the LS), whereas WLS earns the trophy and is much more robust against 10% contamination. MM is the second best.

Up to this point, we have dealt with synthetic data sets. Next, we investigate the performance of MM, WLS, LTS, and LS with respect to real data sets in high dimension.

Example 4

(Performance for a large real data set). Boston housing is a famous data set (see [17]) and studied by many authors with different emphasizes (transformation, quantile, nonparametric regression, etc.) in the literature. For a more detailed description of the data set, see http://lib.stat.cmu.edu/datasets/ accessed on 19 March 2024.

The analysis reported here does not include any of the previous results but consists of just a straight linear regression of the dependent variable (median price of a house) on the thirteen explanatory variables as might be used in an initial exploratory analysis of a new data set. We have sample size

n = 506

and dimension

p = 14

.

We assess the performance of the MM, the LST, the WLS, and the LS as follows. Since some methods depend on randomness, we run the computation

R = 1000

times to alleviate the randomness. (i) We compute the

\hat{β}

with different methods, and we do this 1000 times. (ii) We calculate the total time consumed (in seconds) by different methods for all replications and the EMSE (with true

β_{0}

being replaced by the sample mean of 1000

\hat{β}

s from (i)), which is the sample variance of all

\hat{β}

s up to a factor

1000 / 999

. The results are reported in Table 2.

Inspecting the table reveals that (i) WLS and LS produce the same

\hat{β}

for each sample, so there is no variance, whereas this is not the case for MM and LTS. (ii) LS is the fastest runner followed by MM, LTS, and WLS. (iii) The relative efficiency of MM and LTS is 0% since the sample variance of LS is 0, whereas the RE of WLS and LS is undefined (not a number) since 0 appeared in the denominator. On the other hand, one can interpret WLS as being as good as LS in this case with RE 100%.

Example 5

(Performance for a real data set which is known to contain outliers). We examine the data set of Buxton (1920) [18], which has been studied repeatedly in the literature, see Hawkins and Olive (2002) [19], Olive (2017) [20], Park, Kim, and Kim (2012) [21], Olive and Hawkins (2011) [22].

We fit the different methods to the Buxton data, which is a 87 by 7 matrix (original row 9 was deleted), with height as the response variable and other four variables as predictor variables (two variables are excluded due the missing values) as Olive did. For more explanations, see Olive’s website at http://parker.ad.siu.edu/Olive/buxton.txt accessed on 19 March 2024.

We list in Table 3 the output of the methods (mm, lts, lms wls, ls, hbreg, and rmreg2), where the last two methods are proposed by Olive and Hawkins (2011) and Olive (2017) [20,22], respectively.

With great help from Dr. Olive, we were able to have the pairwise scatter plots of points of

({\hat{y}}_{i}, y_{i})

, namely, fitted values versus observed values and fitted values versus fitted values of different methods. The plot is given in Figure 7 (lms is omitted; it performs much the same as most other robust ones).

Inspecting Figure 7 reveals that there are five obvious outliers on response variable y. Further examining the data set confirms that observations 61:65 have unusual small response values from 18 to 19, while all others are in between 1500 and 1800 and have unusual, larger head length values. The first row of Figure 7 is

({\hat{y}}_{i}, y_{i})

for different methods. It is seen that five out of six methods perform much the same, while rmreg2 performs remarkably different.

The latter produces much larger fitted values for the five outliers which might be interpreted as the method resisting the influence of the outliers while others cope with the five outliers and produce fitted values that are in the same order in magnitude as the observed values, which might be interpreted as these methods being heavily influenced by the five outliers.

To better understand the performance of the six methods, we produced a classic fitted value versus the standardized residuals plot in Figure 8, which clearly identifies five outliers and performance difference between the six methods (rmreg2 performs remarkably different from all others).

Furthermore, to better appreciate the hyperplanes induced from

\hat{β}

in Table 3 and to take the two-dimensional graphic visualization advantage, we look at the two-dimensional vertical cross-section of hyperplanes in the fifth dimension (restricted/project to y versus

x_{3}

dimension) and plot the lines (intercept and head) based on Table 3 by different methods (they are different from the regression lines based on (

x_{3}

, y) by different methods) in Figure 9. From the Figure, we obtain a better understanding of the behavior of different methods. All seven lines but the one from rmreg2 have a negative slope.

Note that both hbreg and rereg2 functions output more than one solution. We chose hbreg$coef (which is identical to ls) and rmreg2$Bhat in this data set case. The lines from hbreg, wls, and lts are almost parallel, while lines from mm and lms are also almost parallel to the majority but far away from the data cloud and should be discarded in this case. Similar plots with other variables could also be constructed.

Lines in Figure 9 are induced from the hyperplanes in Table 3 by projection to the (head, height)-dimension in the five-dimensional space. One naturally wonders: are they the same as the lines from direct regression on (head, height) by different methods? To appreciate the difference between two types of lines, we fit (head, height) [as (x, y)] with different methods, and the lines are given in Figure 10. Inspecting the figure reveals that all the lines perform the same but the line induced from rmreg2.

6. Concluding Remarks

With a novel weighting scheme, the proposed weighted least squares estimator performs as efficiently as the classic least squares (LS) estimator for perfect normal data, while being more efficient than MM and much more efficient than the LTS estimator. It is much more robust than the LS when there is contamination or outliers (it is also more robust than MM and LTS when the contamination level is 30%). It performs as robustly as the LTS and the MM while being more efficient than MM and LTS when there are outliers. It possesses the best finite sample breakdown point robustness while achieving high efficiency and computability. It could serve as a robust alternative to the LTS and the MM in practice.

Author Contributions

Writing—original draft, Y.Z. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors thank Wei Shao for insightful comments and stimulus discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Main Results

Proof of Theorem 1.

For a given

z^{(n)}

and any

β

, write

M : = \sum_{i = 1}^{n} y_{i}^{2} \geq \sum_{i = 1}^{n} w (y_{i}^{2} / c^{*}) y_{i}^{2} = O (0_{p \times 1},

z^{(n)})

. For a given

β \in R^{p}

, hereafter assume that

H_{β}

is the hyperplane determined by

y = w^{⊤} β

, and let

H_{h}

be the horizontal hyperplane (i.e.,

y = 0

, the

w

-space).

Partition the space of

β

s into two parts,

S_{1}

and

S_{2}

, with

S_{1}

containing all

β

s such that

H_{β}

and

H_{h}

are parallel and

S_{2}

consists of the rest of

β

s so that

H_{β}

and

H_{h}

are not parallel.

If one can show that there are minimizers of

O (β, z^{(n)})

over

S_{i}

i = 1, 2

, respectively, then one can have an overall minimizer.

Over

S_{1}

,

β = {(β_{0}, 0_{(p - 1) \times 1}^{⊤})}^{⊤}

and

r_{i} = y_{i} - β_{0}

. If the minimizer does not exist, then it means that any bounded

β_{0}

cannot minimize

O (β, z^{(n)})

, and the absolute value of the minimizer

β_{0}^{*}

must be greater than any

M^{*} > 0

. We seek a contradiction now. Denote the minimizer as

β^{*} = {(β_{0}^{*}, 0_{(p - 1) \times 1}^{⊤})}^{⊤}

. Define

β_{1}^{*} = {(2 β_{0}^{*}, 0_{(p - 1) \times 1}^{⊤})}^{⊤}

, then it is readily seen that

r_{i}^{2} (β_{1}^{*}) > r_{i}^{2} (β^{*})

for large enough

β_{0}^{*}

. By Lemma 1 below, one has

O (β^{*}, z^{(n)}) > O (β_{1}^{*}, z^{(n)})

. A contradiction is obtained.

Over

S_{2}

, denote by

l_{β}

the intersection part of

H_{β}

with the horizontal hyperplane

H_{h}

(we call it a hyperline, though it is

p - 1

-dimensional). Let

θ_{β} \in (- π / 2, π / 2)

be the acute angle between the

H_{β}

and

H_{h}

(and

θ_{β} \neq 0

). Consider two cases.

Case I. All

w_{i} = {(1, x_{i}^{⊤})}^{⊤}

with

r_{i}^{2} / c^{*} \leq c

on the hyperline

l_{β}

, where

r_{i} = y_{i} - w_{i}^{⊤} β

. Then, we have a vertical hyperplane that is perpendicular to the horizontal hyperplane

H_{h}

(y = 0)

and intersect

H_{h}

at

l_{β}

, But this contradicts A1. We only need to consider the other case.

Case II. Otherwise, define

δ = \frac{1}{2} inf {τ, s u c h t h a t N (l_{β}, τ) c o n t a i n s a l l w_{i} w i t h r_{i}^{2} / c^{*} \leq c},

where

N (l_{β}, τ)

is the set of points in

w

-space such that each distance to the

l_{β}

is no greater than

τ

. Clearly,

0 < δ < \infty

(since

δ = 0

has been covered in Case I and

δ \leq {max}_{i} {∥ w_{i} ∥} < \infty

, where i satisfies

r_{i}^{2} / c^{*} \leq c

, and the first inequality follows from the fact that the hypotenuse is always longer than any legs).

We now show that when

∥ β ∥ > (1 + η) \sqrt{M} / δ

, where

η > 1

is a fixed number, then

O (β, z^{(n)}) = \sum_{i = 1}^{n} w (r_{i}^{2} / c^{*}) r_{i}^{2} (β) > M \geq O (0_{p \times 1}, z^{(n)}) .

(A1)

That is, for the solution of the minimization of (5). One only needs to search over the ball

∥ β ∥ \leq (1 + η) \sqrt{M} / δ

, a compact set. Note that

O (β, z^{(n)})

is continuous in

β

since

r_{i} (β)

and

w (r_{i}^{2} / c^{*})

are. Then, the minimization problem certainly has a solution over the compact set.

The proof is complete if we can show (A1) when

∥ β ∥ > (1 + η) \sqrt{M} / δ

. It is not difficult to see that there is at least one

i_{0}

such that

r_{i_{0}}^{2} / c^{*} \leq c

and

w_{i_{0}} \notin N (l_{β}, δ)

since otherwise, it contradicts the definition of

δ

above. Note that

θ_{β}

is the angle between the normal vectors

{(- β^{⊤}, 1)}^{⊤}

and

{(0^{⊤}, 1)}^{⊤}

of hyperplanes

H_{β}

and

H_{h}

, respectively. Then,

| tan θ_{β} | = ∥ β ∥

(see [8]) and (see Figure A1)

| w_{i_{0}}^{⊤} β | > δ | tan θ_{β} | = δ ∥ β ∥ > (1 + η) \sqrt{M} .

Now, we have

\begin{matrix} | r_{i_{0}} (β) | = | w_{i_{0}}^{⊤} β - y_{i_{0}} | & \geq | | w_{i_{0}}^{⊤} β | - | y_{i_{0}} | | > (1 + η) \sqrt{M} - | y_{i_{0}} | . \end{matrix}

(A2)

Therefore,

\begin{matrix} O (β, z^{(n)}) = \sum_{j = 1}^{n} w (r_{j}^{2} / c^{*}) r_{j}^{2} (β) & \geq w (r_{i_{0}}^{2} / c^{*}) r_{i_{0}}^{2} (β) = r_{i_{0}}^{2} (β) \\ > {((1 + η) \sqrt{M} - | y_{i_{0}} |)}^{2} \geq {((1 + η) \sqrt{M} - \sqrt{M})}^{2} \\ = η^{2} M > M \geq O (0_{p \times 1}, z^{(n)}) . \end{matrix}

That is, we have certified (A1). □

Figure A1. A two-dimensional vertical cross-section (that goes through points

(w_{i}^{t}, 0)

and

(w_{i}^{t}, w_{i}^{t} β)

) of a figure in

R^{p}

(

w_{i}^{t} = w_{i}^{⊤}

). Hyperplanes

H_{h}

and

H_{β}

intersect at hyperline

l_{β}

(which does not necessarily pass through

(0, 0)

, here just for illustration). The vertical distance from point

(w_{i}^{t}, 0)

to the hyperplane

H_{β}

,

| w_{i}^{t} β |

, is greater than

δ | tan (θ_{β}) |

.

Figure A1. A two-dimensional vertical cross-section (that goes through points

(w_{i}^{t}, 0)

and

(w_{i}^{t}, w_{i}^{t} β)

) of a figure in

R^{p}

(

w_{i}^{t} = w_{i}^{⊤}

). Hyperplanes

H_{h}

and

H_{β}

intersect at hyperline

l_{β}

(which does not necessarily pass through

(0, 0)

, here just for illustration). The vertical distance from point

(w_{i}^{t}, 0)

to the hyperplane

H_{β}

,

| w_{i}^{t} β |

, is greater than

δ | tan (θ_{β}) |

.

Proof of Lemma 1.

Write

w (r^{2} / c^{*}) r^{2} = c^{*} w (r^{2} / c^{*}) r^{2} / c^{*} : = c^{*} w (x^{2}) x^{2}

, where

x = | r | / \sqrt{c^{*}}

and

x^{2} = r^{2} / c^{*} > c

. It suffices to show that

w (x^{2}) x^{2}

is strictly decreasing in x (this intuitively is clear from Figure 2), or equivalently, to show that the derivative of

w (x^{2}) x^{2}

is negative. A straightforward calculus derivation yields

{(w (x^{2}) x^{2})}^{'} = 2 x / (1 - e^{- k}) (e^{- k {(1 - c / x^{2})}^{2}} (1 - 2 k c / x^{2} (1 - c / x^{2})) - e^{- k}) .

Now it suffices to show that

(e^{- k {(1 - c / x^{2})}^{2}} (1 - 2 k c / x^{2} (1 - c / x^{2})) - e^{- k}) < 0 .

Or, equivalently, it suffices to show that

e^{k ({(1 - c / x^{2})}^{2} - 1)} > 1 - 2 k c / x^{2} (1 - c / x^{2}) .

For convenience, write

t : = c / x^{2}

. Then,

t \to 0

as

x^{2} \to \infty

. Now, we want to show that

e^{- t k (2 - t)} > 1 - 2 k t (1 - t) .

(A3)

A straightforward Taylor expansion of

e^{x} = 1 + x + x^{2} / 2! + x^{3} / 3! + \dots

to the left-hand side (LHS) of (A3) yields

\begin{matrix} e^{- t k (2 - t)} & = 1 + (- 2 k t + k t^{2}) + {(- 2 k t + k t^{2})}^{2} / 2 + {(- 2 k t + k t^{2})}^{3} / 3! + {(- 2 k t + k t^{2})}^{4} / 4! + \dots \\ > 1 + (- 2 k t + k t^{2}) + {(- 2 k t + k t^{2})}^{2} / 2 + {(- 2 k t + k t^{2})}^{3} / 3! \\ = 1 - 2 k t (1 - t) - k t^{2} + {(- k t (2 - t))}^{2} / 6 + 2 {(- k t (2 - t))}^{2} / 6 + {(- k t (2 - t))}^{3} / 6 \\ = 1 - 2 k t (1 - t) + k t^{2} (k {(2 - t)}^{2} / 6 - 1) + k^{2} t^{2} {(2 - t)}^{2} (2 - k t (2 - t)) / 6 \\ > 1 - 2 k t (1 - t) \end{matrix}

(A4)

where the first inequality follows from the fact that

\frac{{(- k t (2 - t))}^{2 n}}{(2 n)!} + \frac{{(- k t (2 - t))}^{2 n + 1}}{(2 n + 1)!} = \frac{{(- k t (2 - t))}^{2 n} (2 n + 1 - k t (2 - t))}{(2 n + 1)!} > 0,

for

n \geq 2

and small enough t. And the last inequality in (A4) follows the facts (i)

k {(2 - t)}^{2} / 6 - 1 > 0

(if

t < 2 - \sqrt{6 / k}

) and (ii)

2 - k t (2 - t) > 0

(if

t < 1 - \sqrt{1 - 2 / k}

).

Combining (A4) with (A3), we complete the proof. □

Proof of Theorem 3.

It suffices to treat the case

p > 1

, and furthermore by Theorem 4 on p. 125 of [9], it is sufficient to show that

m = ⌊ (n - p) / 2 ⌋

contaminating points are not enough to break drown

{\hat{β}}_{w l s}

. Assume it is otherwise. This implies that either

(I): $| {\hat{β}}_{w l s}^{n} {({(z_{m}^{(n)})}_{j})}_{1} | \to \infty$ and $∥ {\hat{β}}_{w l s}^{n} {({(z_{m}^{(n)})}_{j})}_{2} ∥$ is finite, or
(II): $∥ {\hat{β}}_{w l s}^{n} {(z_{m}^{(n)})}_{j})_{2} ∥ = | tan (θ_{{\hat{β}}_{w l s}^{n} {(z_{m}^{(n)})}_{j})}) | \to \infty,$

along a sequence of

{(z_{m}^{(n)})}_{j}

as

j \to \infty

, where the subscripts 1 and 2 correspond to the intercept and non-intercept terms, respectively, as in the case

β = {(β_{1}, β_{2}^{⊤})}^{⊤}

in

R^{p}

. We seek a contradiction for both cases. For description simplicity, write

β_{j} : = {\hat{β}}_{w l s}^{n} ({(z_{m}^{(n)})}_{j})

Case (I). For simplicity, write

β_{j} = {(β_{1}, β_{2}^{⊤})}^{⊤}

and

β_{j j} = {(2^{m} β_{1}, β_{2}^{⊤})}^{⊤}

. Then, it is readily seen that

r_{i}^{2} (β_{j}) < r_{i}^{2} (β_{j j})

for large positive m. In light of Lemma 1, one has that

O (β_{j}) > O (β_{j j})

; a contradiction is obtained.

Case (II). This case implies there is a sequence of hyperplanes induced from

{\hat{β}}_{w l s}^{n} ({(z_{m}^{(n)})}_{j})

that tend to the eventual vertical position as

j \to \infty

. Denote by

H_{j}

those hyperplanes. Let

H_{j}

intercept with the horizontal hyperplane

H_{h}

at

ℓ_{j}

, the hyperlines (or the common part of

H_{j}

and

H_{h}

).

For simplicity, write the minimizer

β_{j} = {(β_{1}, β_{2}^{⊤})}^{⊤} : = {\hat{β}}_{w l s}^{n} ({(z_{m}^{(n)})}_{j})

. Introduce a new hyperplane determined by

β_{j j} = {(α β_{1}, κ β_{2}^{⊤})}^{⊤}

(

κ > 1

is a positive integer). This

β_{j j}

amounts to tilting

H_{j}

(corresponding to

β_{j}

) along

ℓ_{j}

to a more vertical position

H_{j j}

(corresponding to

β_{j j}

). Note that it is possible that there are no data points touched during the titling process except those originally on the

H_{j}

since both hyperplanes are almost vertical. It is readily seen that

r_{i}^{2} (β_{j j}) > r_{i}^{2} (β_{j}) \to \infty

except those points

(x_{i}^{⊤}, y_{i}))^{⊤}

that originally lie on the

ℓ_{j}

with a zero residual. By Lemma 1,

O (β_{j}) > O (β_{j j})

, a contradiction is reached. □

References

Huber, P.J. Robust estimation of a location parameter. Ann. Math. Statist. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Least median of squares regression. J. Amer. Statist. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Yohai, V.J. Robust regression by means of S-estimators. In Robust and Nonlinear Time Series Analysis; Lecture Notes in Statist; Springer: New York, NY, USA, 1984; Volume 26, pp. 256–272. [Google Scholar]
Yohai, V.J. High breakdown-point and high efficiency estimates for regression. Ann. Statist. 1987, 15, 642–656. [Google Scholar] [CrossRef]
Yohai, V.J.; Zamar, R.H. High breakdown estimates of regression by means of the minimization of an efficient scale. J. Am. Stat. Assoc. 1988, 83, 406–413. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Hubert, M. Regression depth (with discussion). J. Am. Stat. Assoc. 1999, 94, 388–433. [Google Scholar] [CrossRef]
Zuo, Y. On general notions of depth for regression. Stat. Sci. 2021, 36, 142–157. [Google Scholar] [CrossRef]
Zuo, Y.; Zuo, H. Least sum of squares of trimmed residuals regression. Electron. J. Stat. 2023, 17, 2416–2446. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Leroy, A. Robust Regression and Outlier Detection; Wiley: New York, NY, USA, 1987. [Google Scholar]
Maronna, R.A.; Martin, R.D.; Yohai, V.J. Robust Statistics: Theory and Methods; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Müller, C. Redescending M-estimators in regression analysis, cluster analysis and image analysis. Discuss. Math. Stat. 2004, 24, 59–75. [Google Scholar] [CrossRef]
Zuo, Y. Projection-based depth functions and associated medians. Ann. Stat. 2003, 31, 1460–1490. [Google Scholar] [CrossRef]
Donoho, D.L.; Huber, P.J. The notion of breakdown point. In A Festschrift foe Erich L. Lehmann; Bickel, P.J., Doksum, K.A., Hodges, J.L., Jr., Eds.; Wadsworth: Belmont, CA, USA, 1983; pp. 157–184. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Edgar, T.F.; Himmelblau, D.M.; Lasdon, L.S. Optimization of Chemical Processes, 2nd ed.; McGraw-Hill Chemical Engineering Series; McGraw-Hill: New York, NY, USA, 2001. [Google Scholar]
Gilbert, J.C.; Nocedal, J. Global Convergence Properties of Conjugate Gradient Methods for Optimization. Siam J. Optim. 1992, 2, 21–42. [Google Scholar] [CrossRef]
Harrison, D.; Rubinfeld, D.L. Hedonic prices and the demand for clean air. J. Environ. Econ. Manag. 1987, 5, 81–102. [Google Scholar] [CrossRef]
Buxton, L.H.D. The Anthropology of Cyprus. J. R. Inst. Great Br. Irel. 1920, 50, 183–235. [Google Scholar] [CrossRef]
Hawkins, D.M.; Olive, D.J. Inconsistency of Resampling Algorithms for High Breakdown Regression Estimators and a New Algorithm, (with discussion). J. Am. Stat. Assoc. 2002, 97, 136–159. [Google Scholar] [CrossRef]
Olive, D.J. Robust Multivariate Analysis; Springer: New York, NY, USA, 2017. [Google Scholar]
Park, Y.; Kim, D.; Kim, S. Robust Regression Using Data Partitioning and M-Estimation. Commun. Stat. Simul. Comput. 2012, 8, 1282–1300. [Google Scholar] [CrossRef]
Olive, D.J.; Hawkins, D.M. Practical High Breakdown Regression. 2011. Available online: http://www.math.siu.edu/olive/pphbreg.pdf (accessed on 19 March 2024).

Figure 1. Weight function

w (x)

when

k = 5

and

c = 100

. Left:

w (x)

, right:

w^{'} (x)

.

Figure 1. Weight function

w (x)

when

k = 5

and

c = 100

. Left:

w (x)

, right:

w^{'} (x)

.

Figure 2. Behavior of function

w (x^{2} / c^{*}) x^{2}

when

k = 5

and

c = 100

,

x > c

.

Figure 2. Behavior of function

w (x^{2} / c^{*}) x^{2}

when

k = 5

and

c = 100

,

x > c

.

Figure 3. Behavior of function

γ_{i} (r_{i}) / c^{*}

when

k = 5

and

r_{i} > c

with different values of c.

Figure 3. Behavior of function

γ_{i} (r_{i}) / c^{*}

when

k = 5

and

r_{i} > c

with different values of c.

Figure 4. Left panel: plot of seven artificial points and two reference lines

y = 0

and

y = x

. Right panel: the same seven points are fitted by LTS, WLS, MM, and the LS (benchmark). A solid black line is LTS given by ltsReg. Green dashed line is given by WLS. Red dotted line is given by the LS, which is identical to LTS line and is almost identical to the blue dot-dashed line given by MM in this case.

Figure 4. Left panel: plot of seven artificial points and two reference lines

y = 0

and

y = x

. Right panel: the same seven points are fitted by LTS, WLS, MM, and the LS (benchmark). A solid black line is LTS given by ltsReg. Green dashed line is given by WLS. Red dotted line is given by the LS, which is identical to LTS line and is almost identical to the blue dot-dashed line given by MM in this case.

Figure 5. We show 80 highly correlated normal points with 30% of them contaminated by other normal points. Left: scatterplot of the uncontaminated perfect normal data set and four almost identical lines. Righ: LTS, WLS, MM, and LS lines for the contaminated data set. The solid black is the LTS line, the dotted green is the WLS, the dot-dash blue is given by MM, and dashed red is given by the LS—parallel to LTS line in this case. The MM line is almost identical to LTS and LS lines.

Figure 6. Performance of four procedures with respective to 1000 normal samples (points are highly correlated) with

p = 10

and

n = 100

, each sample suffers 10% contamination.

Figure 6. Performance of four procedures with respective to 1000 normal samples (points are highly correlated) with

p = 10

and

n = 100

, each sample suffers 10% contamination.

Figure 7. Pairwise plots of fitted values versus observed values and fitted values versus values for six different methods.

Figure 8. Fitted values versus standardized residuals plot for six different methods.

Figure 9. Restricted to (head length, height)-space, the two-dimensional vertical cross-section of hyperplances of seven different methods.

Figure 10. Regression lines based on (x = head length, y = height) by seven different methods.

Table 1. EMSE, TT (s), and RE for MM, LTS, WLS, and LS based on all 1000 samples for various ns, ps, and contamination levels.

Normal Data Sets, Each with $ε$ Contamination Rate
p	n	Method	EMSE	TT	RE	EMSE	TT	RE
			$ε = 0 %$			$ε = 10 %$
5	50	mm	0.3356	9.9427	0.9767	0.3357	9.8483	2.9876
		wls	0.3309	7.3604	0.9905	0.3324	9.4740	3.0178
		lts	0.3975	15.883	0.8246	0.3670	15.957	2.7326
		ls	0.3278	1.4243	1.0000	1.0030	1.2834	1.0000
			$ε = 20 %$			$ε = 30 %$
5	50	mm	0.3565	9.8519	5.3673	8.4738	10.532	0.3311
		wls	0.3546	12.329	5.3951	0.3711	15.846	7.5618
		lts	0.6546	16.662	2.9228	27.223	17.026	0.1030
		ls	1.9132	1.3549	1.0000	2.8060	1.3472	1.0000
			$ε = 0 %$			$ε = 10 %$
10	100	mm	0.2378	21.421	0.8839	0.2372	20.892	5.5816
		wls	0.2105	11.112	0.9985	0.2226	15.680	5.9499
		lts	0.2919	48.648	0.7201	0.2584	49.615	5.1245
		ls	0.2102	1.3298	1.0000	1.3242	1.2542	1.0000
			$ε = 20 %$			$ε = 30 %$
10	100	mm	0.2410	20.669	10.244	5.1124	21.891	0.6979
		wls	0.2372	20.535	10.407	0.2600	29.146	13.724
		lts	0.2635	55.018	9.3714	40.403	64.803	0.0883
		ls	2.4691	1.2462	1.0000	3.5680	1.2626	1.0000
			$ε = 0 %$			$ε = 10 %$
20	200	mm	0.2429	84.709	0.6564	0.2183	83.525	6.6713
		wls	0.1592	28.664	1.0021	0.1726	39.100	8.4390
		lts	0.2208	259.21	0.7224	0.2015	293.40	7.2261
		ls	0.1595	1.4936	1.0000	1.4564	1.4775	1.0000
			$ε = 20 %$			$ε = 30 %$
20	200	mm	0.5299	78.387	5.1922	20.908	90.385	0.1899
		wls	0.1875	51.280	14.677	0.2126	71.148	18.672
		lts	0.1983	387.56	13.877	33.918	832.75	0.1170
		ls	2.7512	1.4566	1.0000	3.9694	1.4300	1.0000

Table 2. EMSE, TT (seconds), and RE for MM, LTS, WLS, and LS based on Boston housing real data set.

Performance Measure	MM	WLS	LTS	LS
EMSE	4.352446 × $10^{- 5}$	0.0000	4.619404 × $10^{1}$	0.0000
TT	120.368098	161.465350	125.707603	1.487204
RE	0	NaN	0	NaN

Table 3. Outputs of different methods based on Buxton data set.

Methods	Intercept	Head	Nasal	Bigonal	Cephalic
hbreg	1546.3737947	−1.1288988	6.1133570	−0.5871985	1.1263726
rmreg2	807.3303643	1.7963508	4.8262483	−0.1481552	3.9353752
wls	1437.3761729	−1.1107210	5.2669763	0.9199388	0.9766958
lts	1066.188018	−1.104774	6.476802	2.523815	2.623706
lms	449.515	−1.061	7.317	6.215	4.790
mm	1511.5503972	−1.1289155	6.5942674	−0.6341536	1.2965989
ls	1546.3737947	−1.1288988	6.1133570	−0.5871985	1.1263726

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zuo, Y.; Zuo, H. Weighted Least Squares Regression with the Best Robustness and High Computability. Axioms 2024, 13, 295. https://doi.org/10.3390/axioms13050295

AMA Style

Zuo Y, Zuo H. Weighted Least Squares Regression with the Best Robustness and High Computability. Axioms. 2024; 13(5):295. https://doi.org/10.3390/axioms13050295

Chicago/Turabian Style

Zuo, Yijun, and Hanwen Zuo. 2024. "Weighted Least Squares Regression with the Best Robustness and High Computability" Axioms 13, no. 5: 295. https://doi.org/10.3390/axioms13050295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Weighted Least Squares Regression with the Best Robustness and High Computability

Abstract

1. Introduction

2. A Class of Weighted Least Squares

2.1. A Class of Weight Functions

2.2. Weighted Least Squares Estimators

3. Properties of the ${\hat{β}}_{w l s}$

3.1. Existence

3.2. Equivariance

3.3. Robustness

4. Computation of the WLS

5. Examples and Comparison

5.1. Performance Criteria

5.2. Examples

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proofs of Main Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Weighted Least Squares Regression with the Best Robustness and High Computability

Abstract

1. Introduction

2. A Class of Weighted Least Squares

2.1. A Class of Weight Functions

2.2. Weighted Least Squares Estimators

3. Properties of the β ^ w l s

3.1. Existence

3.2. Equivariance

3.3. Robustness

4. Computation of the WLS

5. Examples and Comparison

5.1. Performance Criteria

5.2. Examples

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proofs of Main Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Properties of the ${\hat{β}}_{w l s}$