
Download
PDF
0.00MB

Download
XML
0.00MB

Read Online
HTML
0.00MB
Abstract
This paper addresses how to calculate and interpret the timedelayed mutual information (TDMI) for a complex, diversely and sparsely measured, possibly nonstationary population of timeseries of unknown composition and origin. The primary vehicle used for this analysis is a comparison between the timedelayed mutual information averaged over the population and the timedelayed mutual information of an aggregated population (here, aggregation implies the population is conjoined before any statistical estimates are implemented). Through the use of information theoretic tools, a sequence of practically implementable calculations are detailed that allow for the average and aggregate timedelayed mutual information to be interpreted. Moreover, these calculations can also be used to understand the degree of homo or heterogeneity present in the population. To demonstrate that the proposed methods can be used in nearly any situation, the methods are applied and demonstrated on the time series of glucose measurements from two different subpopulations of individuals from the Columbia University Medical Center electronic health record repository, revealing a picture of the composition of the population as well as physiological features.
In this paper, we show how to apply timedelayed mutual information (TDMI) to a sparse, irregularly measured, complicated population of timedependent data. At a fundamental level, the technical problem is a probability density function (PDF) estimation problem; specifically, one can average PDF estimates or one can aggregate the data set before estimating the PDF. To understand and interpret these two means of coping with a population of timeseries, one must address four issues: (1) estimator bias; (2) normalization or distribution supportbased effects; (3) deviations from the single source case for average and aggregate; and (4) practical interpretation. Scientifically, this paper works to develop an infrastructure, and demonstrates how to use it, by studying the timedependent correlation structure in physiological variables of humans—in a population of glucose timeseries. In the end, we not only provide a practically actionable set of information theoretic computations that yield insight into the population composition and the timedependent correlation structure but we also detail the timedependent correlation structure and the degree of homogeneity within a broad population of humans via their glucose measurements.
The authors would like to thank two anonymous reviewers, J. Dias, N. Elhadad, A. Perotte, and D. Varn for carefully reading this paper and providing many useful comments. D.J.A. would like to thank C. Shalizi for early discussions related to this work. Finally, the authors would like to acknowledge the financial support provided by NLM Grant No. RO1 LM06910.
I. INTRODUCTION
A. A reader’s guide: The outline of this paper
II. MOTIVATING EXAMPLES
III. INFORMATION THEORY BACKGROUND
A. Average TDMI
B. Aggregate TDMI
IV. TDMISPECIFIC ESTIMATOR BIASES
A. Sample size dependent estimator bias effects
B. Fixed point bias estimate for average and aggregate populations
C. Nonestimator bias: How the TDMI calculation can act as a population filter
1. Methods for assessing δt bin compositions
V. POPULATIONBASED DEVIATIONS FROM THE INDIVIDUAL TDMI ESTIMATES
A. Heterogeneitybased deviations from the individual: Average TDMI case
1. Entropy of the averaged population
B. Heterogeneitybased deviations from the individual: Aggregate TDMI case
1. Entropy of the aggregated population
VI. HOW TO INTERPRET THE TDMI FOR A POPULATION, OR, TDMIBASED METHODS FOR INTERPRETING POPULATION DIVERSITY
A. Support dependent, graph independent, effects on the population TDMI
B. Graph dependent, support independent, effects on the population TDMI
C. Support dependent, graphbased effects on the population TDMI
VII. NONTDMIBASED METHODS FOR INTERPRETING POPULATION DIVERSITY
A. Homogeneity in measurement composition
B. Homogeneity in measurement distribution supports
C. Homogeneity in the distribution of the graphs of the measurement PDFs
VIII. ASSEMBLING THE PIECES: AN EXPLICIT PRESCRIPTION FOR TDMI ANALYSIS AND INTERPRETATION FOR A POPULATION OF TIME SERIES FOR A FIXED TIME SEPARATION δt
A. Step one: Determining the computability of
B. Step two (A in Fig. 2): Interpreting δI(δt) or
C. Step three (B in Fig. 2): Assessing population representation
IX. QUANTITATIVE EXAMPLES FOR TDMI INTERPRETATION AND POPULATION HOMOGENEITY EVALUATION
A. Simulated data examples: The quadratic map and the Gauss map
1. TDMIbased analysis of the simulated data
2. NonTDMIbased analysis of the simulated data
3. Quantifying small samplesize effects
B. Real data examples: Glucose values for 100 densely sampled individuals versus 20,000 random individuals
1. TDMIbased analysis for data set 7, the well measured population
2. NonTDMIbased analysis for data set 7, the well measured population
3. TDMIbased analysis for data set 8, the random (less well measured) population
4. NonTDMIbased analysis for data set 8, the random (less well measured) population
5. Analysis of the TDMI under variation of δt
X. DISCUSSION AND COMMENTS
A. Specific results of the interpretative framework relative to real data
B. Using categorical billing code data to help verify the TDMI analysis
C. How our method addresses nonstationarity
D. Comments regarding the connection between the supports and the normalizations of the distributions
E. Future directions regarding the use of this technique
F. Some remaining statistical problems
XI. SUMMARY
Key Topics
 Time series analysis
 45.0
 Data sets
 31.0
 Data analysis
 27.0
 Entropy
 17.0
 Aggregation
 15.0
Figures
(Color) Graphically comparing (average PDF) and (PDF of the aggregate) for a collection of three collections of Gaussian random numbers whose distributions have means 0, 2, and 4 respectively.
Click to view
(Color) Graphically comparing (average PDF) and (PDF of the aggregate) for a collection of three collections of Gaussian random numbers whose distributions have means 0, 2, and 4 respectively.
The graphical schematic for the TDMI analysis of a population; note that by TDMI Present, we mean that the relevant TDMI measure (e.g., ) is greater than bias.
Click to view
The graphical schematic for the TDMI analysis of a population; note that by TDMI Present, we mean that the relevant TDMI measure (e.g., ) is greater than bias.
(Color) The graphs of the quadratic map (Eq. (50) ) and the Gauss map (Eq. (51) )—note the significant difference between the graphs of the mappings, and invariant density (PDF of the orbit) for the quadratic map, Gauss map, and the sum of the quadratic and Gauss maps—note the significant differences between the relative p’s.
Click to view
(Color) The graphs of the quadratic map (Eq. (50) ) and the Gauss map (Eq. (51) )—note the significant difference between the graphs of the mappings, and invariant density (PDF of the orbit) for the quadratic map, Gauss map, and the sum of the quadratic and Gauss maps—note the significant differences between the relative p’s.
(Color) PDFs of glucose measurements for individuals within a population and for a population for two data sets, the 100 patients with the largest records and 20 000 random patients.
Click to view
(Color) PDFs of glucose measurements for individuals within a population and for a population for two data sets, the 100 patients with the largest records and 20 000 random patients.
(Color) Comparisons of the supports and PDF graph variations for two data sets, the 100 patients with the largest records and 5000 random patients.
Click to view
(Color) Comparisons of the supports and PDF graph variations for two data sets, the 100 patients with the largest records and 5000 random patients.
(Color online) The TDMI for both and with δt bins of 6 h for a period of a few days for D _{7} and D _{8}; note that the bias estimates can be found in Tables VIII and VII . With respect to (a), note the following: for δt ≤ 6 h, δI > 0 and for δt > 6 h, δI ≈ 0; the KDE and histogram estimates are extremely similar; the diurnal (daily) periodic variation in correlation of glucose is clearly evident in both and . With respect to (b), note the following: for all δt δI is consistent and likely zero within bias; the KDE and histogram estimates differ greatly, implying the presence of small sample size effects in the average TDMI calculation; the diurnal (daily) periodic variation in correlation of glucose is clearly evident in both and in all but the KDE estimated TDMI average.
Click to view
(Color online) The TDMI for both and with δt bins of 6 h for a period of a few days for D _{7} and D _{8}; note that the bias estimates can be found in Tables VIII and VII . With respect to (a), note the following: for δt ≤ 6 h, δI > 0 and for δt > 6 h, δI ≈ 0; the KDE and histogram estimates are extremely similar; the diurnal (daily) periodic variation in correlation of glucose is clearly evident in both and . With respect to (b), note the following: for all δt δI is consistent and likely zero within bias; the KDE and histogram estimates differ greatly, implying the presence of small sample size effects in the average TDMI calculation; the diurnal (daily) periodic variation in correlation of glucose is clearly evident in both and in all but the KDE estimated TDMI average.
Tables
How to interpret the TDMI for a population of time series
Click to view
How to interpret the TDMI for a population of time series
Summary of all the nonTDMI based metrics used to assess homogeneity in a population (both among the graphs and the supports) used to verify the TDMItype analysis.
Click to view
Summary of all the nonTDMI based metrics used to assess homogeneity in a population (both among the graphs and the supports) used to verify the TDMItype analysis.
Summary of all the TDMIbased metrics used to interpret the TDMI and determine the population composition.
Click to view
Summary of all the TDMIbased metrics used to interpret the TDMI and determine the population composition.
Complete list of the simulated data sets.
Click to view
Complete list of the simulated data sets.
TDMI results and homogeneity metrics for the simulated data sets one through six.
Click to view
TDMI results and homogeneity metrics for the simulated data sets one through six.
Heuristic homogeneity metrics for the simulated data sets one through six.
Click to view
Heuristic homogeneity metrics for the simulated data sets one through six.
Complete list of the real patient data sets.
Click to view
Complete list of the real patient data sets.
TDMI results and homogeneity metrics for the real patient data sets seven and eight; note all δt times are in hours.
Click to view
TDMI results and homogeneity metrics for the real patient data sets seven and eight; note all δt times are in hours.
TDMI results and homogeneity metrics for the real patient data sets seven and eight; note all δt times are in hours.
Click to view
TDMI results and homogeneity metrics for the real patient data sets seven and eight; note all δt times are in hours.
Time independent TDMI results for the real patient data sets seven and eight.
Click to view
Time independent TDMI results for the real patient data sets seven and eight.
Heuristic homogeneity metrics for the real patient data sets seven and eight.
Click to view
Heuristic homogeneity metrics for the real patient data sets seven and eight.
Article metrics loading...
Abstract
This paper addresses how to calculate and interpret the timedelayed mutual information (TDMI) for a complex, diversely and sparsely measured, possibly nonstationary population of timeseries of unknown composition and origin. The primary vehicle used for this analysis is a comparison between the timedelayed mutual information averaged over the population and the timedelayed mutual information of an aggregated population (here, aggregation implies the population is conjoined before any statistical estimates are implemented). Through the use of information theoretic tools, a sequence of practically implementable calculations are detailed that allow for the average and aggregate timedelayed mutual information to be interpreted. Moreover, these calculations can also be used to understand the degree of homo or heterogeneity present in the population. To demonstrate that the proposed methods can be used in nearly any situation, the methods are applied and demonstrated on the time series of glucose measurements from two different subpopulations of individuals from the Columbia University Medical Center electronic health record repository, revealing a picture of the composition of the population as well as physiological features.
Full text loading...
Commenting has been disabled for this content