How to deal with zeros in data?

Aixia_Mei · Unread post by **Aixia_Mei** » Mon Jan 05, 2015 2:42 pm

Hi Tom,

I've posted before, but I still get questions about how to deal with 0s in the data set.

My research aims to figure out the influence of the introduction of short selling on feedback trading and volatility in Chinese stock market. As can be seen in the excel document, there are many zeros in the time series of each stocks' log return. Specifically, for those zeros that are in a row for all object stocks , it is because China sometimes has short-term national holidays.While for successive zeros in a column of individual stocks, it could be explained as a period of stock suspension due to the company's own operation problem. So both of these two types of zeros actually means non-trading days.

As you suggested, I've tried to delete the first type of zeros, those 4 or 5 zeros in a row, but the results seem still weird. There is always a no convergence problem. Could you please help me to have a look?

The input codes are as follows:

OPEN DATA "C:\Users\aixia\Desktop\51 Return_xls.xls"
CALENDAR(D) 2006:04:03
DATA(FORMAT=XLS,ORG=COLUMNS,RIGHT=2) 2006:04:03 2014:03:31 PUDONG
*
set r1 = PUDONG
*
nonlin b0 b1 b2 b3 a0 a1 a2 a3 d
compute d=1.0

stat(NOPRINT) r1
compute start = 2
compute end = 1943

**
set v = %variance
set u = 0.0

frml et = r1-b0-b1*v-(b2+b3*v)*r1{1}
frml ht = a0+a1*u{1}**2+a2*v{1}+%if(u{1}<0.0, a3*u{1}**2, 0.0)

***Using GED density
frml Lt = (v(t)=ht(t)), (u(t)=et(t)), $
log(.5)+log(d)+.5*%lngamma(3/d)-1.5*%lngamma(1/d)-.5*log(v)- $
exp((d/2.0)*(%lngamma(3/d)-%lngamma(1/d)))*(abs(u/sqrt(v)))**d

linreg(noprint) r1; # constant r1{1}
compute b0=%beta(1), b1=0.0, b2=%beta(2), b3=0.0
compute a0=%seesq, a1=.09781, a2=.83756, a3=0.0
nlpar(subiter=250)

maximize(method=simplex,recursive,iterations=6) Lt 2 *
maximize(method=bfgs,robust,recursive,iter=500) Lt 2 *

The output is:

MAXIMIZE - Estimation by Simplex
Daily(5) Data From 2006:04:04 To 2014:03:31
Usable Observations 19
Skipped/Missing (from 2085) 2066
Function Value 172.1642

Variable Coeff
**********************************************
1. B0 0.000057147
2. B1 -0.301938042
3. B2 0.014385197
4. B3 -0.055167436
5. A0 0.000193743
6. A1 0.077369570
7. A2 -0.064979929
8. A3 -0.070342733
9. D 0.151616428

MAXIMIZE - Estimation by BFGS
NO CONVERGENCE IN 18 ITERATIONS
LAST CRITERION WAS 0.0000010
ESTIMATION POSSIBLY HAS STALLED OR MACHINE ROUNDOFF IS MAKING FURTHER PROGRESS DIFFICULT
TRY HIGHER SUBITERATIONS LIMIT, TIGHTER CVCRIT, DIFFERENT SETTING FOR EXACTLINE OR ALPHA ON NLPAR
RESTARTING ESTIMATION FROM LAST ESTIMATES OR DIFFERENT INITIAL GUESSES MIGHT ALSO WORK
With Heteroscedasticity/Misspecification Adjusted Standard Errors
Daily(5) Data From 2006:04:04 To 2014:03:31
Usable Observations 19
Skipped/Missing (from 2085) 2066
Function Value 268.2201

Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. B0 0.000057 0.000001 40.36660 0.00000000
2. B1 -0.304113 0.009320 -32.62939 0.00000000
3. B2 0.014385 0.000000 0.00000 0.00000000
4. B3 -0.055167 0.000000 0.00000 0.00000000
5. A0 0.000195 0.000016 11.87531 0.00000000
6. A1 -5514.806599 0.170650 -32316.40693 0.00000000
7. A2 -0.043771 0.025546 -1.71343 0.08663372
8. A3 58066.352495 31.382881 1850.25561 0.00000000
9. D 0.108551 0.038620 2.81076 0.00494253

Many thanks,
Aixia

TomDoan · Unread post by **TomDoan** » Tue Jan 06, 2015 11:13 am

The strings of zeros aren't directly the problem. The problem is that you're using the GED which (as I pointed out in the earlier post) is a bad choice. Use the t instead if you want to allow for fat tails. You're also using very old codings---both the GED and the t have built-in functions, MAXIMIZE has the PMETHOD and PITERS options so you don't have to do two separate MAXIMIZE instructions and the RECURSIVE option has no effect.

As to what you should do with the zeros, if you have a string of observations where the markets are closed, you probably want those out of the data set. You might want to put in a dummy to indicate the first entry after the markets have been closed and add a variance shift to the GARCH model to allow for higher variance on those observations.

Aixia_Mei · Unread post by **Aixia_Mei** » Wed Jan 07, 2015 10:46 pm

Many thanks for your reply. I've rewritten the two places you pointed out.The updated codes is:

OPEN DATA "C:\Users\aixia\Desktop\51 Return_xls.xls"
CALENDAR(D) 2006:04:03
DATA(FORMAT=XLS,ORG=COLUMNS,RIGHT=2) 2006:04:03 2014:03:31 PUDONG
*
set r1 = PUDONG
*
nonlin b0 b1 b2 b3 a0 a1 a2 a3

stat(NOPRINT) r1
compute start = 2
compute end = 2087

*
set v = %variance
set u = 0.0

***GJR-GARCH(1,1)
frml et = r1-b0-b1*v-(b2+b3*v)*r1{1}
frml ht = a0+a1*u{1}**2+a2*v{1}+%if(u{1}<0.0, a3*u{1}**2, 0.0)

***with t distributed errors
garch(asymmetric,p=1,q=1,distrib=t,regressors,hseries=ht) / r1
# constant v r1{1} v*r1{1}

**
linreg(noprint) r1; # constant r1{1}
compute b0=%beta(1), b1=0.0, b2=%beta(2), b3=0.0
compute a0=%seesq, a1=.09781, a2=.83756, a3=0.0
nlpar(subiter=250)

maximize(pmethod=simplex,piters=15,method=bfgs,iters=500,robust) ht gstart gend *

The above codes is run with a result of an error message, which I cannot understand:

## SX22. Expected Type SERIES[REAL], Got FRML[REAL] Instead
>>>>ressors,hseries=ht)<<<<

Regards,
Aixia

TomDoan · Unread post by **TomDoan** » Thu Jan 08, 2015 7:46 am

Because you're trying to use HT in two different ways:

FRML:

frml ht = a0+a1*u{1}**2+a2*v{1}+%if(u{1}<0.0, a3*u{1}**2, 0.0)

***with t distributed errors

SERIES:

garch(asymmetric,p=1,q=1,distrib=t,regressors,hseries=ht) / r1
# constant v r1{1} v*r1{1}

Use different names for the different purposes.

Aixia_Mei · Unread post by **Aixia_Mei** » Thu Jan 08, 2015 1:17 pm

TomDoan wrote:Because you're trying to use HT in two different ways:

FRML:

frml ht = a0+a1*u{1}**2+a2*v{1}+%if(u{1}<0.0, a3*u{1}**2, 0.0)

***with t distributed errors

SERIES:

garch(asymmetric,p=1,q=1,distrib=t,regressors,hseries=ht) / r1
# constant v r1{1} v*r1{1}

Use different names for the different purposes.

Many thanks for your reply.

I've rewritten the code of t distribution, but I don't know how to set the value of x in the t density function.
I get an error message from RATS as:
## SX11. Identifier X is Not Recognizable. Incorrect Option Field or Parameter Order?
>>>>t)),%TDENSITY(x,nu)<<<<

My codes is as follows:

OPEN DATA "C:\Users\aixia\Desktop\51 Return_xls.xls"
CALENDAR(D) 2006:04:03
DATA(FORMAT=XLS,ORG=COLUMNS,RIGHT=2) 2006:04:03 2014:03:31 PUDONG
*
set r1 = PUDONG
*
nonlin b0 b1 b2 b3 a0 a1 a2 a3

stat(NOPRINT) r1
compute start = 2
compute end = 2087

*
set v = %variance
set u = 0.0

***GJR-GARCH(1,1)
frml et = r1-b0-b1*v-(b2+b3*v)*r1{1}
frml ht = a0+a1*u{1}**2+a2*v{1}+%if(u{1}<0.0, a3*u{1}**2, 0.0)

***with t distributed errors
frml Lt = (v(t)=ht(t)), (u(t)=et(t)),%TDENSITY(x,nu)

**
linreg(noprint) r1; # constant r1{1}
compute b0=%beta(1), b1=0.0, b2=%beta(2), b3=0.0
compute a0=%seesq, a1=.09781, a2=.83756, a3=0.0
nlpar(subiter=250)

maximize(pmethod=simplex,piters=15,method=bfgs,iters=500,robust) Lt gstart gend *

TomDoan · Unread post by **TomDoan** » Thu Jan 08, 2015 2:23 pm

You want the %LOGTDENSITY function, not %TDENSITY. The x and nu in the description of the %TDENSITY (note that they're different for %LOGTDENSITY) are the formal arguments. You have to replace them with the appropriate variables or expressions for the way you are writing this. You don't even have a degrees of freedom variable in your parameter set, so you're not getting too far anyway, but your variance will be v(t) and your "u" will be u(t).

Note that if the message says that "X" is undefined, it means that X is undefined, so either have a mistake and didn't want to say "X" or you forgot to define X.

Aixia_Mei · Unread post by **Aixia_Mei** » Thu Jan 08, 2015 4:10 pm

TomDoan wrote:You want the %LOGTDENSITY function, not %TDENSITY. The x and nu in the description of the %TDENSITY (note that they're different for %LOGTDENSITY) are the formal arguments. You have to replace them with the appropriate variables or expressions for the way you are writing this. You don't even have a degrees of freedom variable in your parameter set, so you're not getting too far anyway, but your variance will be v(t) and your "u" will be u(t).

Note that if the message says that "X" is undefined, it means that X is undefined, so either have a mistake and didn't want to say "X" or you forgot to define X.

Many thanks Tom. But are there any examples in either the free manuals or the materials of the GARCH volatility course have mentioned how to do this? I didn't find by myself.

TomDoan · Unread post by **TomDoan** » Thu Jan 08, 2015 4:27 pm

There are quite a few that use %LOGDENSITY (for a Normal). For instance, GARCH7_1.RPF in the course:

frml logl = u=sp-meanf,uu=u^2,h=hf,%logdensity(h,u)

For the same model with a t, you add the degrees of freedom to the parameter set (NU is the obvious name), initialize it (NU=10 probably will work fine) and replace the %logdensity(h,u) with %logtdensity(h,u,nu) (with h and u being whatever you are calling your variance and residual).

Aixia_Mei · Unread post by **Aixia_Mei** » Fri Jan 09, 2015 10:36 am

TomDoan wrote:There are quite a few that use %LOGDENSITY (for a Normal). For instance, GARCH7_1.RPF in the course:

frml logl = u=sp-meanf,uu=u^2,h=hf,%logdensity(h,u)

For the same model with a t, you add the degrees of freedom to the parameter set (NU is the obvious name), initialize it (NU=10 probably will work fine) and replace the %logdensity(h,u) with %logtdensity(h,u,nu) (with h and u being whatever you are calling your variance and residual).

Thanks a lot Tom. By following your instructions, I finally get the result. But both the coefficient and standard of b3 seem strange. The values are too big to be correct. Besides, when I tried to run my codes with only the first half of the first stock's time series data, there is even no converge. However, when I tired with other time series data of the second stock, which have much lesser zeros, it converges. Can you give me some suggestions? I don't really know what can I do to solve this problem.

My current codes are as follows:
OPEN DATA "C:\Users\aixia\Desktop\51 Return_xls.xls"
CALENDAR(D) 2006:04:03
DATA(FORMAT=XLS,ORG=COLUMNS,RIGHT=2) 2006:04:03 2010:03:30 PUDONG
*
set r1 = PUDONG
*
nonlin b0 b1 b2 b3 a0 a1 a2 a3 nu
compute nu=10.0

stat(NOPRINT) r1
compute start = 2
compute end = 1043

*
set v = %variance
set u = 0.0

***GJR-GARCH(1,1)
frml et = r1-b0-b1*v-(b2+b3*v)*r1{1}
frml ht = a0+a1*u{1}**2+a2*v{1}+%if(u{1}<0.0, a3*u{1}**2, 0.0)

***with t distributed errors
frml Lt = (v(t)=ht(t)), (u(t)=et(t)),%LOGTDENSITY(v,u,nu)

**
linreg(noprint) r1; # constant r1{1}
compute b0=%beta(1), b1=0.0, b2=%beta(2), b3=0.0
compute a0=%seesq, a1=.09781, a2=.83756, a3=0.0
nlpar(subiter=250)

maximize(pmethod=simplex,piters=15,method=bfgs,iters=500,robust) Lt 2 *

The results with two different time range:

MAXIMIZE - Estimation by BFGS
Convergence in 28 Iterations. Final criterion was 0.0000061 <= 0.0000100
With Heteroscedasticity/Misspecification Adjusted Standard Errors
Daily(5) Data From 2006:04:04 To 2014:03:31
Usable Observations 2085
Function Value 5012.9736

Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. B0 -0.00098677 0.00050287 -1.96227 0.04973082
2. B1 0.99633406 0.73397511 1.35745 0.17463847
3. B2 -0.02106737 0.02713513 -0.77639 0.43752027
4. B3 -3.45171694 19.76990309 -0.17459 0.86139825
5. A0 0.00000194 0.00000203 0.95537 0.33939265
6. A1 0.07091891 0.03167404 2.23902 0.02515440
7. A2 0.94568887 0.01941585 48.70707 0.00000000
8. A3 0.01062862 0.02529564 0.42018 0.67435686
9. NU 3.04404519 0.43005143 7.07833 0.00000000

MAXIMIZE - Estimation by BFGS
NO CONVERGENCE IN 36 ITERATIONS
LAST CRITERION WAS 0.0000000
ESTIMATION POSSIBLY HAS STALLED OR MACHINE ROUNDOFF IS MAKING FURTHER PROGRESS DIFFICULT
TRY HIGHER SUBITERATIONS LIMIT, TIGHTER CVCRIT, DIFFERENT SETTING FOR EXACTLINE OR ALPHA ON NLPAR
RESTARTING ESTIMATION FROM LAST ESTIMATES OR DIFFERENT INITIAL GUESSES MIGHT ALSO WORK
With Heteroscedasticity/Misspecification Adjusted Standard Errors
Daily(5) Data From 2006:04:04 To 2010:03:30
Usable Observations 1041
Function Value 2195.8753

Variable Coeff Std Error T-Stat Signif
************************************************************************************
1. B0 0.000008 0.000022 0.38762 0.69830055
2. B1 0.351136 0.557324 0.63004 0.52866853
3. B2 0.051197 0.026988 1.89705 0.05782105
4. B3 2.316022 12.489829 0.18543 0.85288970
5. A0 0.000000 0.000000 0.19918 0.84211948
6. A1 0.860045 0.228412 3.76533 0.00016633
7. A2 0.588121 0.087199 6.74456 0.00000000
8. A3 2.076042 0.597498 3.47456 0.00051169
9. NU 2.291973 0.178240 12.85893 0.00000000

Regards,
Aixia

TomDoan · Unread post by **TomDoan** » Fri Jan 09, 2015 12:29 pm

It's your model. Is that interaction term (the B3) your idea? In B3*V*R1{1}, the B3*V needs to be no more than O(1) or the model will go unstable. Thus, if you change the scale of the data, or run over a range with higher volatility, the B3 will have to adapt. Unlike B2, there is no natural scale for it. To be perfectly honest, I would be shocked if B3 came in statistically significant.

I'd be concerned with the fact that you're getting NU's that close to 2. Those are really, really fat tails, which would make me think that there are either some serious outliers or a structural break in the model.

Aixia_Mei · Unread post by **Aixia_Mei** » Fri Jan 09, 2015 10:40 pm

TomDoan wrote:It's your model. Is that interaction term (the B3) your idea? In B3*V*R1{1}, the B3*V needs to be no more than O(1) or the model will go unstable. Thus, if you change the scale of the data, or run over a range with higher volatility, the B3 will have to adapt. Unlike B2, there is no natural scale for it. To be perfectly honest, I would be shocked if B3 came in statistically significant.

I'd be concerned with the fact that you're getting NU's that close to 2. Those are really, really fat tails, which would make me think that there are either some serious outliers or a structural break in the model.

The baseline model I currently adopt is actually a very popular and classic model of feedback trading behavior, the heterogeneous trader model (Sentana and Wadhwani, 1992). I've used my codes (both GJR-GARCH with GED and t) to run data from other countries' markets, all results seem good. So I started to compare the data set of Chinese market and data from other markets, then noticed the significant difference is that quite a lot of zeros are in China's data. So think may be that's the problem.

What I just did is deleting all the zeros in my database but keep the column of date unchanged, just to see whether I can get proper results with a zero-free data set. It seems you are correct, the results still look strange, some of them have no convergence, some have very weird coefficient values. I'll discuss this with my supervisors and may return to you again.

Thanks a lot for your time and suggestions.

Aixia_Mei · Unread post by **Aixia_Mei** » Sun Feb 22, 2015 2:31 pm

TomDoan wrote:It's your model. Is that interaction term (the B3) your idea? In B3*V*R1{1}, the B3*V needs to be no more than O(1) or the model will go unstable. Thus, if you change the scale of the data, or run over a range with higher volatility, the B3 will have to adapt. Unlike B2, there is no natural scale for it. To be perfectly honest, I would be shocked if B3 came in statistically significant.

I'd be concerned with the fact that you're getting NU's that close to 2. Those are really, really fat tails, which would make me think that there are either some serious outliers or a structural break in the model.

Hi Tom,

To solve those strings of zeros in my time series data, I decide to introduce a deterministic, time varying variable Nt to my GARCH variance equation. Nt here represents the length of non-trading period: the number of the non-trading days between the current trading day t to the preceding trading day t-1.

I get some questions to write the code of Nt. For this variable, we have firstly to make sure the condition of Rt<>0, Rt-1==0 is met. And then the number of the continuous zeros in the log return of the time series will be calculated and displayed. My question is how to write the code of Nt? How to ask RATS to calculate the number of a string of zeros for a specific time point?

Perhaps something like below:
If r{1}==0 and r<>0,
set nt =
else
set nt = 0

Many thanks,
Aixia

TomDoan · Unread post by **TomDoan** » Sun Feb 22, 2015 5:10 pm

See the description of the %IF function in the Introduction to RATS book.

Aixia_Mei · Unread post by **Aixia_Mei** » Mon Feb 23, 2015 3:33 pm

TomDoan wrote:See the description of the %IF function in the Introduction to RATS book.

Hi Tom,

Thanks for that. But for the function %if(x,y,z), the condition of my case is not about whether the value of x is equal to 0 or not. So I don't know whether my idea can be written as:
set nt = %if(r{1}==0 r<>0,y,0)

Also, my original question is actually "How to ask RATS to calculate the total number of the non-trading days? In other words, how to ask RATS to calculate the number of a string of continuous zeros?" Which means how to write the code of y in the above code?

Kind regards,
Aixia

TomDoan · Unread post by **TomDoan** » Mon Feb 23, 2015 11:00 pm

The following will give you a count of consecutive zeroes in the series x:

set(first=(x==0.0)) zerocount = %if(x==0,zerocount{1}+1,0.0)

The RATS Software Forum

How to deal with zeros in data?

How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?

Re: How to deal with zeros in data?