This is a follow-up to an earlier post about CovarianceMatrix.

There are several ways in which Statistics:-CorrelationMatrix can be improved.

CorrelationMatrix shares some inefficiencies with CovarianceMatrix, by computing correlations between the n columns, pairwise. But in doing so it also computes the variance of both columns' data for each pairing of columns. That's unnecessary repetition. For each i from 1 to n it computes the variance of column i from scratch, n separate times instead of just once. (Yes, it takes advantage of the symmetry. But it still does (n^2)/2 intsead of n variance calculations.)

Here is some code below which illustrates how it may be done more efficiently. It could be made faster still using only C instead of many Maple Library routine calls. The point is to illustrate the improvement possible in both speed and memory management.

Apart from that complexity issue (about the repeated variance calculations), I'd mention another thing. It can be very beneficial to avoid any command inside a loop which creates (allocates) a new temporary Matrix/Vector/Array. I believe that such is avoided below.


> kernelopts(printbytes=false):

> Corr := proc(M)
> local i,m,n,c,tempc,tempV,res,vones;
> Digits := max(Digits,trunc(evalhf(Digits)));
> m,n := op(1,M);
> if rtable_options(M,'datatype') <> 'float'
> or rtable_options(M,'storage') <> 'rectangular'
> or rtable_options(M,'order') <> 'Fortran_order'
> or rtable_indfns(M) <> [] then
> # M does not have rtable options required by ArrayTools,
> # and copy of M can act as container for intermediate results.
> c := Matrix(m,n,M,'datatype'='float','storage'='rectangular',
> 'order'='Fortran_order');
> res := c;
> else
> # M does have rtable options required by ArrayTools,
> # but new container for intermediate results is needed.
> c := M;
> res := Matrix(m,n,'datatype'='float');
> end if;
> tempc := Vector(m,'datatype'='float');
> tempV := Vector(m,'datatype'='float');
> vones := Vector(m,'datatype'='float');
> ArrayTools:-Fill(m,1.0,vones,0,1);
> for i from 1 to n do
> ArrayTools:-Copy(m,c,(i-1)*m,1,tempc,0,1);
> ArrayTools:-Fill(m,Statistics:-Mean(tempc),tempV,0,1);
> LinearAlgebra:-LA_Main:-VectorAdd(tempV,tempc,-1,1,
> 'inplace'=true,'outputoptions'=[]);
> LinearAlgebra:-LA_Main:-VectorScalarMultiply(tempV,
> 1/sqrt(LinearAlgebra:-LA_Main:-DotProduct(tempc,tempc,'conjugate'=false)
> - (LinearAlgebra:-LA_Main:-DotProduct(tempc,vones,
> 'conjugate'=false)^2)/m),
> 'inplace'=true,'outputoptions'=[]);
> ArrayTools:-Copy(m,tempV,0,1,res,(i-1)*m,1);
> end do;
> LinearAlgebra:-Transpose(res).res;
> end proc:


> M := LinearAlgebra:-RandomMatrix(600,300):
> origM := copy(M):

> st,ba:=time(),kernelopts(bytesalloc):
> ans1 := Corr(M):
> time()-st,kernelopts(bytesalloc)-ba;
0.271, 11663272

> st,ba:=time(),kernelopts(bytesalloc):
> ans2 := Statistics:-CorrelationMatrix(M):
> time()-st,kernelopts(bytesalloc)-ba; # 450MB !!
10.974, 448839400

> LinearAlgebra:-Norm(ans1-ans2), LinearAlgebra:-Norm(evalf(M)-evalf(origM));
-8
0.252635772539496969 10 , 0.


> gc():

> M := LinearAlgebra:-RandomMatrix(1000,1000,outputoptions=[datatype=float]):
> origM := copy(M):

> st:=time(): ans1 := Corr(M): time()-st;
1.737

> st:=time(): ans2 := Statistics:-CorrelationMatrix(M): time()-st;
118.441

> LinearAlgebra:-Norm(ans1-ans2), LinearAlgebra:-Norm(M-origM);
-8
0.458945009096666011 10 , 0.


Someone might way to check my computation of any individual column's variance, in there. (To ensure that I didn't use an n instead of an (n-1) for sample variance, or anything else like that.)

acer


Please Wait...