^{1,a)}, R. Lora Serrano

^{2}, B. Aragón Fernández

^{3}and I. Brito Reyes

^{3}

### Abstract

Random sequences attain the highest entropy rate. The estimation of entropy rate for an ergodic source can be done using the Lempel Ziv complexity measure yet, the exact entropy rate value is only reached in the infinite limit. We prove that typical random sequences of finite length fall short of the maximum Lempel-Ziv complexity, contrary to common belief. We discuss that, for a finite length, maximum Lempel-Ziv sequences can be built from a well defined generating algorithm, which makes them of low Kolmogorov-Chaitin complexity, quite the opposite to randomness. It will be discussed that Lempel-Ziv measure is, in this sense, less general than Kolmogorov-Chaitin complexity, as it can be fooled by an intelligent enough agent. The latter will be shown to be the case for the binary expansion of certain irrational numbers. Maximum Lempel-Ziv sequences induce a normalization that gives good estimates of entropy rate for several sources, while keeping bounded values for all sequence length, making it an alternative to other normalization schemes in use.

Lempel and Ziv showed that measuring the capacity of an ergodic source to generate new patterns could be used to estimate its entropy rate. Entropy rate is a length invariant measure of the amount of new information gained per unit step in a dynamical process. It has been found ubiquitous in a number of areas, such as the study of dynamical systems, information theory (from where its definition is usually drawn), or complexity theory; and it has become one of the fundamental concepts in data analysis. Lempel-Ziv complexity (LZ76) method of estimating the entropy rate from a single discrete series of measurement carries a number of practical advantages. It is generally assumed that random sequences, being of maximum entropy rate, attain maximum Lempel-Ziv complexity for sequence of finite length. We prove that contrary to this common belief, finite random sequences do not attain maximum Lempel-Ziv complexity. The consequence is that normalization of LZ76 complexity by that of a random sequence does not yield a properly bounded value, and estimated entropy rates in this way derived, can achieve inconsistent values above one. Instead, there is perfectly deterministic way, and therefore non-random procedure of generating maximum Lempel-Ziv sequences (MLZs) that can be used for normalization purposes, yielding sound estimates of entropy rate.

E. Estevez-Rams thanks the Humboldt Foundation (AvH) for computational infrastructure support and the FAPEMIG (BPV-00039-12) for financial assistance in mobility.

I. INTRODUCTION

II. LEMPEL-ZIV COMPLEXITY

III. MAXIMUM LEMPEL-ZIV COMPLEXITY SEQUENCES

IV. NORMALIZATION OF LEMPEL-ZIV COMPLEXITY

V. LZ76 COMPLEXITY AND COMPLEXITY RATE ESTIMATION FOR DIFFERENT SOURCES

VI. CONCLUSIONS

### Key Topics

- Entropy
- 41.0
- Sequence analysis
- 13.0
- Chaos
- 9.0
- Information and communication theory
- 3.0
- Information theory entropy
- 3.0

## Figures

LZ76 complexity as a function of sequence length N, for MLZs and for 102 random sequences. The MLZs upper bound is clearly observed, while the simulated random sequences (rnd) are below the MLZs values and mostly above the curve. In the inset, it can be seen that LZ76 complexity for the random sequences can lie also below the curve.

LZ76 complexity as a function of sequence length N, for MLZs and for 102 random sequences. The MLZs upper bound is clearly observed, while the simulated random sequences (rnd) are below the MLZs values and mostly above the curve. In the inset, it can be seen that LZ76 complexity for the random sequences can lie also below the curve.

Number of 00 patterns (#[00]) in the MLZs and random (rnd) sequences as a function of sequence length N. While the #[00] for the random sequences exhibit the expected linear behavior with slope 1/4, the behavior for the MLZs departs from a linear law.

Number of 00 patterns (#[00]) in the MLZs and random (rnd) sequences as a function of sequence length N. While the #[00] for the random sequences exhibit the expected linear behavior with slope 1/4, the behavior for the MLZs departs from a linear law.

Normalized counts for all patterns of length 6 in the MLZs and random (rnd) sequences (string length N = 106). Patterns are ordered by their binary values. In the rnd curve, all patterns have counts near the expected value 2−6(= 0.0156), while for the MLZs, counts vary from slightly below 0.0141 to slightly above 0.0168.

Normalized counts for all patterns of length 6 in the MLZs and random (rnd) sequences (string length N = 106). Patterns are ordered by their binary values. In the rnd curve, all patterns have counts near the expected value 2−6(= 0.0156), while for the MLZs, counts vary from slightly below 0.0141 to slightly above 0.0168.

LZ76 complexity as a function of the sequence length N (log-log scale) for the binary expansion of π, , and sequences, together with the MLZs and random (rnd) sequences. The binary expanded irrational numbers cannot be distinguished from the random LZ76 complexity for all lengths considered. All LZ76 complexities are below the MLZs complexity.

LZ76 complexity as a function of the sequence length N (log-log scale) for the binary expansion of π, , and sequences, together with the MLZs and random (rnd) sequences. The binary expanded irrational numbers cannot be distinguished from the random LZ76 complexity for all lengths considered. All LZ76 complexities are below the MLZs complexity.

c(u) and for the logistic map given by Eq. (10) , and compared to the random (rnd) sequence. Three values for the logistic map parameter were considered: the chaotic regime r = 1.8; the intermittent point r = 1.7499; and the Feigenbaum point (fb) at r = 1.40115518. See text for details.

c(u) and for the logistic map given by Eq. (10) , and compared to the random (rnd) sequence. Three values for the logistic map parameter were considered: the chaotic regime r = 1.8; the intermittent point r = 1.7499; and the Feigenbaum point (fb) at r = 1.40115518. See text for details.

The relative error of c(u) and estimates ( ) of the true entropy (h) for the r = 1.8 logistic map as function of the sequence length N.

The relative error of c(u) and estimates ( ) of the true entropy (h) for the r = 1.8 logistic map as function of the sequence length N.

Finite State Automata for the nearest neighbor interaction range. S is the start state, while F and B are recurrent states. represents the probability of emitting a symbol X conditioned on being in state M.

Finite State Automata for the nearest neighbor interaction range. S is the start state, while F and B are recurrent states. represents the probability of emitting a symbol X conditioned on being in state M.

## Tables

Estimated entropies (bit/symbol) for the logistic map (Eq. (10) ). The second column is the entropy rate value from Ref. 23 . Third and fifth columns are the value of c(u) and , respectively, for the 105 length sequence, each value is averaged over 50 sequences. Fourth and six columns correspond to the entropy rate estimated by fitting the values of c(u) (Eq. (8) ) and (Eq. (9) ), respectively. Values between round brackets are the relative errors with respect to columns two values.

Estimated entropies (bit/symbol) for the logistic map (Eq. (10) ). The second column is the entropy rate value from Ref. 23 . Third and fifth columns are the value of c(u) and , respectively, for the 105 length sequence, each value is averaged over 50 sequences. Fourth and six columns correspond to the entropy rate estimated by fitting the values of c(u) (Eq. (8) ) and (Eq. (9) ), respectively. Values between round brackets are the relative errors with respect to columns two values.

Estimated entropy (bit/symbol) for a -Bernoulli process. See Table I for a description of the column values.

Estimated entropy (bit/symbol) for a -Bernoulli process. See Table I for a description of the column values.

Estimated entropy (bit/symbol) for nearest neighbor Ising model. represents the probability of emitting a symbol X conditioned by being on state M. The rest of the columns follows the same description than Table I .

Estimated entropy (bit/symbol) for nearest neighbor Ising model. represents the probability of emitting a symbol X conditioned by being on state M. The rest of the columns follows the same description than Table I .

Article metrics loading...

Full text loading...

Commenting has been disabled for this content