1 Introduction
Second order methods specifically Natural Gradient Descent (NGD) [2, 1, 6, 17] have gained traction in recent years as they accelerate the training of deep neural networks (DNN) by capturing the geometry of the optimization landscape [21] with the Fisher Information Matrix (FIM) [23, 17, 16]. NGD’s running time depends on both the convergence rate and computations per iteration which typically involves computing and inverting the FIM [13]. Exact NGD [30] methods demonstrate improved convergence compared to first order techniques such as Stochastic Gradient Descent (SGD)[25], and approximate NGD variants attempt to improve the time per iteration, hence the overall time, with approximation [13, 11, 12]. However, to our knowledge, none of the current approximate NGD approaches outperform or are comparable with the endtoend wallclock time of tuned SGD when training DNNs [13, 11, 12].
Exact NGD methods such as [30] and [3] improve the convergence of training DNNs compared to firstorder methods, however, they are typically expensive and do not scale due to high computational complexity per iteration. The work in [3]
derives an analytic formula for the linear network using the generalized inverse of the curvature which cannot be used for training stateoftheart models due to lack of nonlinear activation functions. Zhang
et al. [30]extend NGD to deep nonlinear networks with nonsmooth activations and show that NGD converges to the global optimum with a linear rate. However, their method fails to scale to large or even moderate size models primarily because it relies heavily on backpropagating Jacobian matrices, which scales with the network’s output dimension
[7]. In [24] the authors use Woodbury identity for the inversion of Fisher matrix and propose a unified framework for subsampled GaussNewton and NGD methods. Their framework is limited to fully connected networks and relies on empirical Fisher and requires extra forwardbackward passes to perform parameter updates which slows down the training [19, 7].Approximate NGD approaches such as [13, 11, 12, 5, 14, 8] attempt to improve the overall execution time of NGD with FIM inverse approximation, however, due to costly operations in the inverse approximation process they do not noticeably reduce the overall NGD time. In these methods, the FIM is approximated with a blockdiagonal matrix, where blocks correspond to layers. The dimension of each block scales with the input and output dimension of the layer and therefore it cannot be computed and inverted efficiently for wide layers. To alleviate this cost, some methods further approximate each block inverse to reduce its size and computation complexity. For example, KFAC [13], approximates each block inverse using the Kronecker product of two smaller matrices, i.e. Kronecker factors. However, these factors have large sizes for wide layers and hence their inversion is expensive. EKFAC [11]
improves the approximation used in KFAC by rescaling the Kronecker factors with a diagonal matrix obtained via costly singular value decompositions. Other work such as KBFGS
[12]further estimates the inverse of Kronecker factors using lowrank BFGS types updates. WoodFisher
[27] estimates the empirical FIM block inverses using rankone updates, however, this estimation will not contain enough useful curvature information to produce a good search direction [20]. Our proposed method, TENGraD, improves the time efficiency of NGD by reducing the FIM block inversion cost using a computationally efficient covariance factorization and reuse of intermediate values method.Recent work such as [30, 3] show that exact NGD converges to the global optimum with a linear convergence rate. Work such as [24] analyze convergence for the LevenbergMarquardt variant of NGD without specifying the convergence rate. The work in [30] provides the convergence analysis of KFAC for a shallow twolayer fully connected network where the convergence rate depends on the condition number of input data. Goldfarb et al. [12] follow the framework for stochastic quasiNewton methods and prove that KBFGS converges with a sublinear rate for a network with bounded activation functions. TENGraD improves the convergence properties of approximate NGD methods and has a linear convergence rate for DNNs with nonsmooth activation functions.
Motivation and contributions: (a) and (b) show the performance of approximate NGD methods (including TENGraD) and SGD for an image classification benchmark on a DNN. As shown in (a), the NGD methods provide more accurate updates resulting in a better loss reduction compared to SGD. However, as shown in (b), despite efforts made by approximate NGD methods to improve the overall time, none are competitive with tuned SGD with momentum. TENGraD is the first NGD based implementation, which we call a timeefficient implementation, that outperforms other approximate NGD methods as well as SGD (under certain conditions) in overall time while benefiting from the convergence properties of secondorder methods.
In the following, we demonstrate for a fully connected layer shown in Figure 1c, why TENGraD is more timeefficient compared to stateoftheart approximate NGD methods. Figure 2 shows KFAC, EKFAC, KBFGS, and TENGraD. All methods first use a blockapproximate approach to create the blocked matrix . KFAC, EKFAC, and KBFGS approximate with a Kronecker product of two Kronecker factors and . Because the size of factors is proportional to the input and output dimension of a layer, i.e. and , their inversion is costly to compute for wide layers and is cubic in input and output dimensions, i.e. .EKFAC creates factors and with a complexity that is also cubic in input and output dimensions of a layer. It also computes the scaling factor by performing number of matrix multiplications with the SVD factors. The rank2 BFGS updates in KBFGS lead to a computation complexity that is quadratic in the input and output size of a layer.
TENGraD computes the exact inverse of Fisherblocks using the Woodbury matrix identity so the inverse is factorized into matrices and using a novel method called covariance factorization. The size of and is equal to a minibatch size, i.e. , which is typically in the order of hundreds hence considerably smaller and easier to inverse compared to Kronecker factors in other NGD methods. Other matrices, i.e. and , involved in the Fisherblock inverse in TENGraD, are reused
. This accelerates the inverse computation as it removes extra passes on the network via storing intermediate values computed during the forwardbackward passes. As a result, TENGraD becomes an actual timeefficient approximate NGD method and improves the overall time. Also, to the best of our knowledge, we are the first to prove the linear rate convergence of approximate NGD methods for deep neural networks with ReLU activation functions. Our contributions are:

A timeefficient NGD method called TENGraD that computes the exact inverse of Fisher blocks efficiently by using a covariance factorization and reuse
technique. Extension of TENGraD to convolutional layers using tensor algebra.

Empirical demonstration that TENGraD outperforms overall time of stateoftheart NGD methods and SGD on wellknown deep learning benchmarks.

TENGraD converges to the global optimum with a linear rate if exact gradients are used with an iteration complexity of which is comparable to exact NGD methods.
2 Background and Definitions
A deep neural network transforms its input to an output through a series of layers where
is the vector of network parameters. During the forward pass, layer
performs a linear transformation on its input of
samples, i.e. , parameterized with , and computes its outputs or preactivations, , followed by a nonlinear activation function such as ReLU. The optimal parameters are obtained by minimizing the average loss over the training set during the backward pass:(1) 
where is the vector containing all of the network’s parameters concatenated together, i.e. , is the operator which vectorizes matrix by stacking columns together, and measures the accuracy of the prediction (e.g. cross entropy loss).
Gradient and Jacobian in DNN. Optimization methods that aim to minimize the average loss in Equation 1 work with gradient computations. The average gradient over minibatch is defined as which is obtained using standard backpropagation. In the backpropagation, each layer receives the gradient w.r.t its output, i.e. the preactivation derivatives , and computes the gradient w.r.t its parameters, i.e. . The Jacobian of loss w.r.t the parameters for a single output network is defined as where is the gradient of loss w.r.t the parameters. In the stochastic setting, for a layer the Jacobian of loss w.r.t the layer parameters is approximated via .
NGD update rule. Natural gradient descent scales the gradient with the inverse of FIM as follows:
(2) 
where is the learning rate, is the damping parameter, is the FIM of the model’s conditional distribution and is defined as where is the model’s density function. Since is singular for overparameterized models, a nonnegative damping term is added to make it invertible. As shown in [21], the FIM coincides with GaussNewton matrix if the conditional distribution is in the exponential family which implies:
(3) 
where , and . Hereafter, we use to simplify the notations.
Notations. Following notations are used in the paper: , is the block diagonal matrix with matrices on its diagonal; is the input data, i.e. .
is the smallest eigenvalue of matrix
; Hadamard product between two matrices is ; the Kronecker product is ; for the Euclidean norm of vectors/matrices; and is for the columnwise KhatriRao product; is used to denote the th columns of matrix .3 TimeEfficient NGD with Exact Fisherblock Inversion
The computation bottleneck in NGD is and its gradient product. Naively computing the Fisher inverse is impractical due to its large size, i.e. where is the number of parameters and can be millions. Therefore, the Fisher matrix is typically approximated with a blockdiagonal matrix where block corresponds to layer [13]. As a result, becomes a block diagonal matrix. TENGraD computes the exact inverse of each block, i.e. (hereafter, all computations are performed on one block, thus, is removed from notations). TENGraD uses the Woodbury identity for each Fisherblock:
(4) 
where , and are the parameters, gradient, and Jacobian associated with the Fisherblock (we drop the index to simplify the notations). Main steps of the new update rule are: (A) computing and inverting the Gram Jacobian , which is costly because the Jacobian and a matrixmatrix multiplication have to be computed, (B) scaling of the gradient using the Jacobian to transform the gradient into input space, and (C) rescaling with to transform back to the parameter space. The Gram matrix computation in (A) has high complexity and is proportional to both input and output dimensions of a layer, i.e. . TENGraD mitigates this cost with covariance factorization that factors the Gram Jacobian into a lowcost operation between two small matrices. Steps (B) and (C) also involve Jacobian vector products that are orders of magnitude more computationally expensive than a gradient computation [7]. TENGraD mitigates this cost by reusing the input and activations computed during the forwardbackward pass.
3.1 Covariance Factorization and Reusing of Intermediate Values in TENGraD
TENGraD factors the Gram matrix into two smaller matrices and called covariance factors such that , is the batch size. These factors can be obtained with low overhead as batch size is typically small. For only one sample and for the fully connected layer in (c), with the input and the preactivations , we show the lowrank property of the gradient and how it is leveraged to reduce the computation complexity of the Gram Jacobian. The gradient of the loss w.r.t parameter for input sample is computed using the input and preactivation derivatives by . The gradient is the outer product of two vectors and therefore a rankone matrix. The element of Gram Jacobian is the inner product of vectorized gradients of two samples and , i.e. . With , the Gram Jacobian becomes and with , it is rewritten as . As a result, the th element of the Gram Jacobian is efficiently computed without forming the gradients. Extended to a minibatch of samples, the persample gradients are written as the columnwise KhatriRao product of input and preactivation derivatives, i.e. . Therefore the compact form of the Gram Jacobian is:
(5) 
From Equation 5, the Gram Jacobian is written as the Hadamard product of two smaller matrices, i.e. covariance factors. The input covariance and the preactivation covariance are efficiently computed during the forward and backward pass of the neural network using a matrixmatrix multiplication. The Jacobian does not need to be explicitly formed to compute these factors, reducing the overall computation complexity. Also, these factors are smaller and have low storage costs compared to Kroneckerfactors, which have a size of and .
The cost of storing the Jacobian for the NGD update in steps (B) and (C) of Equation 4 is considerable because its storage cost is times the number of parameters. With a reuse method, TENGraD instead recomputes the Jacobian when needed, with a negligible computation cost and by storing small data structures. TENGraD breaks down the computation of Jacobian vector products in steps (B) and (C) into two operations involving input and preactivation derivative matrices and . In step (B) in Equation 4 the vectorized form of the gradient for the layer parameters, i.e. , is propagated. The objective is to compute without explicitly forming the Jacobian. With and using the properties of columnwise KhatriRao product:
(6) 
where is a vector of all ones with appropriate dimension. The Jacobian vector product can be computed with two efficient matrix operations without forming the Jacobian. First, is computed with , followed by a Hadamard product with the preactivation derivative matrix, i.e. , and a columnwise summation on . In (C) in (4), the objective is to compute with obtained from step (A). Using the columnwise KhatriRao structure of :
(7) 
Similarly, is computed using the two step process in Equation 7. By storing and , TENGraD can efficiently compute the required operations in the NGD update.
Storage and Computation Complexity. Table 1 shows the computation and storage complexity of TENGraD vs other NGD methods. Curvature shows the cost of computing and inverting Kroneckerfactors for KFAC, EKFAC, and KBFGS, and Gram Jacobian for TENGraD. The computation complexity of covariance factors is , while Kronecker factors in KFAC have a complexity of . Inverting the factors in TENGraD is while that of KFAC is . Thus, when the batch size is smaller than layer dimensions TENGraD’s computation complexity is better than others. TENGraD reduces storage complexity with reuse from to , see the Extra Storage column. Extra Pass refers to the cost of a second backpropagation and Step is the computation cost of parameter updates after computing the curvature.
Algorithm  Curvature  Extra Pass  Step  Extra Storage 

TENGraD  
KFAC  
EKFAC  
KBFGS 
3.2 Extension to Convolution Layers
We extend TENGraD to support convolutional layers, general form provided in the Appendix. Convolutional layers operate on highdimension tensors. For a general tensor with dimension , the components of a tensor can be represented with where index is associated with dimension . We consider a convolutional layer with the input tensor and convolutional filters . For the weight tensor the spatial support of each kernel filter is . There are input channels and feature maps. Let be the output of the convolutional layer, where and . Also, and
To extend covariance factorization for in a convolutional layer, we first derive a closed form equation for the Jacobian and show that the convolution operation can be written as a matrix multiplication for a single sample. From Figure 3, can be reshaped to a matrix with dimensions , each row is a unfolded subtensor in that has the same size of a group of filters. Each filter is also unfolded into a 1D vector with size . The weight tensor is reshaped to a matrix with size (see Figure 3). Hence, convolution can be written as , where the shape of is . Matrix is reshaped to the output tensor by folding each column into a feature map. Applying a similar unfolding to the preactivation derivatives tensor results in the matrix with a shape of .
The compact form of the Jacobian for a batch of samples, with , is:
(8) 
where and are the unfolded input and output for sample . We derive a closed form for an arbitrary element in the Gram Jacobian matrix and then extend to the whole batch size. Using Equation 8 each element in the Gram Jacobian is written as:
(9) 
Hence, each element in Gram Jacobian is obtained via a reduction on covariance matrices and . The covariance matrices are defined over the spatial support .
With tensor notations, we expand the dimensions of covariance factors. Since element in Gram Jacobian is represented with two covariance matrices, the Gram Jacobian is covariance matrices. The covariance tensors and are defined as:
(10) 
Where denotes the subtensor at . As a result the Gram Jacobian is written as:
(11) 
Corollary 1
For a standard convolutional layer with spatial support , overall filter support and batch size , TENGraD reduces the computational complexity of the Gram Jacobian computation from to .
For TENGraD’s reuse for step (B), the gradient tensor of the convolution layer, i.e. is reshaped to to compute . Using Equation 9, this can be written as a two step process without explicitly forming the Jacobian matrix: . Similarly, for step (C), for is computed with . For better storage, instead of we store the input tensor .
4 Linear Convergence in TENGraD
We provide convergence guarantees for TENGraD with exact gradients (i.e. full batch case with ). We focus on the single output but a general case with multiple outputs will be similar. As shown TENGraD, converges linearly at a rate independent from the condition number of the input data matrix , unlike the convergence rate provided for KFAC at [30]. Consequently, for sufficiently illconditioned data, our convergence guarantees will improve upon those currently available for KFAC.
Lets us introduce the output vector where and . We consider the squared error loss on a given dataset with and , i.e. the objective is to minimize . The update rule (4) of TENGraD with exact gradient becomes
(12) 
where is the Fisherblock matrix and the block Jacobian is defined as for and . It can be seen directly from the TENGraD update rule (4) that if the Gram matrix stays positive definite and bounded for each layer and iteration , then by the standard theory of preconditioned gradient descent methods [4], it will converge globally for sufficiently small stepsizes. In the following, we introduce two assumptions; the first one ensures that at the initialization Gram matrix is positivedefinite and the second assumption is a stability condition on the Jacobian requiring that it does not vary too rapidly. These assumptions will allow us to control the convergence rate.
Assumption 1
The data is normalized, , and for any , and are independent. Lastly, the Gram matrix for individual layers are positive definite at initialization, i.e. .
Lee et al. [9] shows that if the input data are i.i.d. samples, then Assumption 1 can often be satisfied.
Assumption 2
There exists that satisfies if .
Notice that Assumption 2 requires that the network behaves like a linearized network for small values of constant [9]. As a result of Assumptions 12, the Fisher block matrix remains close to the initialization over iterations and therefore the Gram matrices of layers stay positivedefinite during training which is parallel to the setting of [30, 10, 22]. In practice, we could also keep Gram matrices wellconditioned during training by tuning the damping parameter.
Next, we present our convergence result in Theorem 1. Our analysis follows the techniques of [30] and adapts them for TENGraD. Notice that TENGraD differs from NGD and KFAC in terms of the matrix it uses to approximate FIM; therefore results of [30] do not directly apply to our setting. As a proof technique, we first utilize the Assumptions 1 and 2 to derive a lower bound on the smallest eigenvalue of the Fisherblock matrix showing that it is positive, and building on this result we derive the rate in the following result. The proof can be found in the Appendix.
Theorem 1
Theorem 1 states that TENGraD converges to the global optimum with a linear rate under Assumptions 1 and 2. Therefore, TENGraD requires only iterations to achieve accuracy. Moreover, our result shows that a smaller learning rate is needed for a deeper network to guarantee linear convergence. This behavior is different than that of exact NGD because the rate of exact NGD provided in [30] does not depend on the network size but rather on the parameter . We also observe from Theorem 1 that smaller damping parameter improves the upper bound on the learning rate which suggests faster convergence for TENGraD. This makes the analysis of important for the performance of TENGraD since Gram matrix can potentially be illconditioned. The work [30] uses matrix concentration inequalities and harmonic analysis techniques to bound the minimum eigenvalue of the Gram matrix of the Llayer ReLU network which does not provide a tight bound for the Gram matrix of each layer. The authors of [22] have obtained tighter bounds in the NTK setting, and their result can be used to derive lower bound on of . Due to limited space, we present the details on the explicit bound on at Appendix.
We can show that under some mild conditions on the distribution of input data , if the weights of layer are drawn from , and the conditions and are satisfied, then the smallest eigenvalue of Gram matrix satisfies for all
with some probability where the coefficient
depends on the variances
(see Appendix). This implies that we can choose a smaller damping parameter to get a faster rate according to Theorem 1 which, in turn, improves the performance of TENGraD. Otherwise, needs to be larger to guarantee the inversion at (12).5 Experimental Results
We test TENGraD, KFAC, EKFAC, KBFGS, and SGD with momentum on three image classification benchmarks namely FashionMNIST
[28], CIFAR10
[18] and CIFAR100
) [18]. For FashionMNIST
a network with its convolution layers and one fully connected layer is used, hence called 3C1F
. For other benchmarks, the following DNN architectures are used DenseNet
[15], WideResNet
[29] and MobileNetV2
[26].
In all experiments, 50K samples are chosen for training and 10K for test. Experiments are conducted on AWS P3.2xlarge machines with one Volta V100 GPU and 16GB RAM. Each experiment is repeated 5 times and the average is reported. In our theoretical analysis, we assumed deterministic gradients but in the experiments we use stochastic gradients estimated with minibatch size . The Appendix provides additional setup information.
From the training loss plots in Figure 4, TENGraD clearly outperforms all other approximate NGD methods in timing and train accuracy. Moreover, TENGraD performs better than SGD in both time and accuracy on FashionMNIST dataset, as shown in (a), and it has a competitive timing compared to SGD on both CIFAR10 and CIFAR100 as shown in (b) and (c). The test accuracy reported in Figure 5 shows that TENGraD also generalizes well and achieves stateoftheart accuracy on all benchmarks, and especially outperforms KBFGS with a margin of 1.6% . Moreover, TENGraD clearly outperforms SGD in both time and test accuracy on FashionMNIST, as shown in (a), with competitive results on CIFAR10 and CIFAR100.
To further demonstrate the performance of TENGraD, we conduct extensive experiments on stateoftheart DNNs and report the time and test accuracy for the 7 configurations in Table 2. Our results show that TENGraD outperforms KFAC, EKFAC, and KBFGS in overall time in 7 out of 7 (7/7) experiments. Moreover, TENGraD achieves a better accuracy, with a margin between 0.4%1.41%, in 6/7 experiments compared to KBFGS which is the fastest method among approximate Kroneckerfactored NGD methods. Also, compared to SGD, TENGraD achieves a better timing for 3/7 experiments and a better test accuracy in 5/7. In particular, TENGraD shows a 20% improvement over SGD in time for the FashionMNIST dataset and a better timing for CIFAR10 on WideResNet and DenseNet.
Models  Dataset  Time(s) [Accuracy%]  

SGD  TENGraD  KFAC  EKFAC  KBFGS  
3C1F  FMNIST  52 [92.60]  43 [93.15]  188 [92.2]  275 [92.01]  113 [92.12]  
WideResNet 







DenseNet 







MobileNetV2 






To compare the quality of Fisher matrix approximation, we compare the blockdiagonal approximation in TENGraD and KFAC with exact NGD. From Figure 6, TENGraD provides a similar approximation to the exact NGD by preserving the structure of the Fisher matrix. However, KFAC’s approximation hardly captures the structure. Similar results for other layers are observed (see Appendix).
To demonstrate the robustness, we examine the train accuracy under various hyperparemeter (HP) settings and show TENGraD is stable under a wide range for the HPs (see Appendix).
6 Conclusion
We propose a TimeEfficient NGD method, called TENGraD, that improves the overall time of second order methods with computationally efficient factorization and reuse methods and prove linear convergence rates under certain assumptions. Extension to convolutional and general layers is provided. Results indicate that TENGraD outperforms or perform comparably to the stateoftheart NGD methods as well as SGD. Our work does not have a direct negative impact as it is mostly theoretical.
Acknowledgements
This work was supported in part by NSERC Discovery Grants (RGPIN06516, DGECR00303), the Canada Research Chairs program, and NSF grants CCF1814888, NSF DMS2053485, NSF DMS1723085 and Office of Naval Research Award Number N000142112244.
References
 [1] (1997) Neural learning in structured parameter spaces  natural riemannian gradient. In Advances in Neural Information Processing Systems, M. C. Mozer, M. Jordan, and T. Petsche (Eds.), Vol. 9, pp. . External Links: Link Cited by: §1.
 [2] (1998) Natural gradient works efficiently in learning. Neural computation 10 (2), pp. 251–276. Cited by: §1.
 [3] (2019) Exact natural gradient in deep linear networks and application to the nonlinear case. Cited by: §1, §1.
 [4] (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §4.

[5]
(2017)
Practical gaussnewton optimisation for deep learning.
In
International Conference on Machine Learning
, pp. 557–565. Cited by: §1.  [6] (2019) A gramgaussnewton method learning overparameterized deep neural networks for regression problems. Machine learning (1/28). Cited by: §1.
 [7] (2019) BackPACK: packing more into backprop. In International Conference on Learning Representations, Cited by: §1, §3.
 [8] (2015) Natural neural networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §1.
 [9] (201909–15 Jun) Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1675–1685. External Links: Link Cited by: §4, §4.
 [10] (2018) Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: §4.
 [11] (2018) Fast approximate natural gradient descent in a kronecker factored eigenbasis. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1, §1.
 [12] (2020) Practical quasinewton methods for training deep neural networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 2386–2396. External Links: Link Cited by: §1, §1, §1.
 [13] (2016) A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp. 573–582. Cited by: §1, §1, §3.

[14]
(2000)
On “natural” learning and pruning in multilayered perceptrons
. Neural Computation 12 (4), pp. 881–901. Cited by: §1. 
[15]
(2017)
Densely connected convolutional networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4700–4708. Cited by: §A.4.1, §5. 
[16]
(2019)
Universal statistics of fisher information in deep neural networks: mean field approach.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 1032–1041. Cited by: §1.  [17] (2020) Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 10891–10901. External Links: Link Cited by: §1.
 [18] (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: §5.
 [19] (2019) Limitations of the empirical fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
 [20] (2015) Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §1.
 [21] (2020) New insights and perspectives on the natural gradient method. Journal of Machine Learning Research 21 (146), pp. 1–76. Cited by: §1, §2.
 [22] (2020) Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. arXiv preprint arXiv:2012.11654. Cited by: §A.3.2, §A.3.2, §A.3.2, §4, §4.
 [23] (2018) The spectrum of the fisher information matrix of a singlehiddenlayer neural network. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §1.
 [24] (2019) Efficient subsampled gaussnewton and natural gradient methods for training neural networks. arXiv preprint arXiv:1906.02353. Cited by: §1, §1.
 [25] (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
 [26] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §A.4.1, §5.
 [27] (2020) WoodFisher: efficient secondorder approximation for neural network compression. Advances in Neural Information Processing Systems 33. Cited by: §1.
 [28] (2017) Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §5.
 [29] (2016) Wide residual networks. In British Machine Vision Conference 2016, Cited by: §A.4.1, §5.
 [30] (2019) Fast convergence of natural gradient descent for overparameterized neural networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1, §1, §1, §4, §4, §4, §4.
Appendix A Appendix
a.1 Outline
The general form of the covariance factorization discussed in Section 3.2 is provided in Section A.2. The convergence proof of Theorem 1 discussed in Section 4 is provided in Section A.3.1. The bounds for minimum eigenvalue discussed in Section 4 is provided in A.3.2. We provide the setup information in Section A.4.1. The plots that compare the quality of approximation of Fisher inverse blocks are provide in Section A.4.2
. We provide the sensitivity to hyperparameters figures in Section
A.4.3. For the error bars refer to Section A.5.a.2 General From of Covariance Factorization
In this section, we extend the results obtained in Section 3 and show that the covariance factorization and reuse method can be applied to an arbitrary layer. An arbitrary layer can operate on its input tensor in many ways depending on its underlying linear transformations. Typically these transformations operate differently on each tensor dimension. For example, a transformation such as convolution applies a specific filtering on spatial dimensions. Therefore, we need to distinguish between these dimensions depending on the operations performed on them. In the following, we categorize these dimensions into free, independent, and common dimensions.
Tensor Dimensions. A common dimension is the dimension of the unfolded input that is shared amongst input and output tensors, i.e. spatial dimensions. An independent dimension is the dimension of the unfolded input that is shared amongst input and the parameters, i.e. input channel and filter dimensions . A free dimension is the dimension of the parameters that is shared amongst output and the parameters, i.e. output channel .
Now, we show that the Jacobian matrix can be formed as a linear transformation amongst two tensors namely input and preactivation derivatives which later is used to compute the Gram Jacobian .
Lemma 1
Consider a layer with parameters that applies a linear transformation on input tensor where and are the indices of free dimension and batch samples. Suppose that the output for a sample can be computed without requiring other samples, i.e. there is no dependency amongst samples in a batch. The preactivation derivatives are given by tensor . We refer to set as common dimensions or and as independent dimensions or . Then the Jacobian tensor can be written as:
(13) 
where tensor is obtained by unfolding the input tensor :
(14) 
where depends on the underlying linear transformation.
Proof of Lemma 1. Assume the arbitrary linear transformation as follow:
(15) 
Where and define the underlying linear transformation such as the input and filter spatial dimensions in the convolutional layer. Now, we can rewrite Equation 15 by unfolding the input tensor. To do so, for each index we can unfold the input tensor using the following:
(16) 
The above equation unfolds the input along all the dimensions in the shared dimensions with the output tensor, i.e. , and for all the dimensions in the parameter tensor. As a result, the output can be written as the following:
(17) 
Using this result and the fact that the preactivation derivatives tensor has the same dimension as output tensor , we can obtain the Jacobian tensor as stated in the Equation 13.
Lemma 1 shows that the Jacobian tensor can be achieved with a linear transformation between the unfolded input tensor and the preactivation derivative tensor . Now, using this result we can extend the covariance factorization in TENGraD to an arbitrary layer.
Theorem 2
For a linear layer with input tensor and preactivation derivatives defined in Lemma 1, the Gram matrix is obtained by:
where and are the covariance tensors of unfolded input and preactivation derivatives defined as:
Proof of Theorem 2. Using Lemma 1 we can write the Gram Jacobian matrix by expanding the common dimensions as follows:
As a result, the Gram Jacobian for any layer can be obtained by a reduction on the Hadamard product of two small covariance tensors. In the following Corollary we state the conditions such that Theorem 2 improves the performance in practice by reducing the computation complexity.
Corollary 2
Suppose we show the size of common dimensions, independent dimensions and free dimensions by , and . Then Theorem 2 reduces the complexity of Gram matrix computation from to .
Remark 1
Theorem 2 applies to a wide range of layers in deep neural networks such as fully connected, convolution, and group convolution. It can also be applied to layers with nonlinear transformation as long as the nonlinearity only applies to the input data such as layer normalization.
a.3 Proofs of Section 4
a.3.1 Convergence Analysis of TENGraD
In this section, we provide the proof for Theorem 1. If Assumption 2 holds, we have . This result can help bound the minimum eigenvalue of the Gram Jacobian at each iteration, stated in the following Lemma.
Lemma 2
If , then we have .
Proof of Lemma 2. Based on the inequality and based on the fact that , we have:
We can now prove the linear convergence rate in Theorem 1.
Proof of Theorem 1. We prove Theorem 1 by induction. Assume . We can see that with and where
is the identity matrix of dimension
.
Comments
There are no comments yet.