A model for CDRs data: motivation, goal, idea

The following project aims to define a model for the Call Detail Records (CDRs) data generated by the Telecom Italia cellular network.

The detection of basic trends and associations on customer usage needs a cost-effective mechanism that normalize the data according to relevant space and time dependences of the service consumption. Moreover, the creation of a data model that can provide richer information used to make better decisions is expected to be of easy implementation, without the needs of changes or additional response time with data growth.

The average dynamics of user activity during a typical working day depends mostly on spatial factors (as the population density) and on temporal factors (awakening and lunchtime for example). Their contributions on the user activity recorded on CDRs data are superimposed and variable during the day and over the space. The goal of the proposal is to find a way to uncouple the aforementioned factors, so to define a model able to answer at questions like the following: How does the spatial factor change on time? and How does the temporal factor change over space?

The idea is to formulate a oversimplified model with a spatial factor constant on time and a temporal factor constant over space. Show that the model is a good candidate to describe the user activity dynamics during the typical working day, then describe the difference between the simplified model and the CDRs data adding temporal dependence to the spatial factor or spatial dependence to the temporal factor

The data

Received SMS Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city of Milano. The CDRs considered are the following: 100*100 CDRs (the Milano grid) every 10 minutes (144 intervals a day) every working-day between November 1st 2013 to December 23th 2013 (37 days). In total: 100*100*144*37 = 53280000 CDRs. Of every CDR, only its value, the cell and the time have been considered.

The model

Let \(c(i,t)\) the CDRs data associated to cell \(i\) at time \(t\). Then we define \(C(i,t)\) as the mean of the \(c(i,t)\) values calculated over the 37 working days considered. In other words $$ C(i,t) = \frac{1}{n+1}\sum_r c(i,r) $$ where \(r = t, t + 144, t + 2 \times 144, ... , t + n \times 144\) and \(n+1\) are the number of working days taken into account (144 are the number of 10 minutes intervals in a day).

The observables \(C(i,t)\) describe the average dynamics of user activity during a typical working day and they depends mostly on spatial factors (as the population density) and on temporal factors (awakening and lunchtime for example). Then we formulate a simplified model with a spatial factor constant on time and a temporal factor constant over space, in formula we write \(C(i,t)\) as $$ C(i,t) = A(t) D(i) $$ where \(A(t)\) and \(D(i)\) are the temporal and spatial factors aforementioned. Note that \(A(t)\) does not depend on the spatial variable \(i\) and \(D(i)\) does not depend on the temporal variable \(t\).

Consistence with the data

We show that the model is a good candidate to describe the user activity dynamics during the typical working day. First note that $$ \sum_k C(k,t) = A(t) \sum_k D(k) \propto A(t) \\ \sum_r C(i,r) = D(i) \sum_r A(r) \propto D(i) $$ Then we have $$ C(i,t) \propto \sum_k C(k,t) \sum_r C(i,r) $$

Let define the random variable \(V(i,t;j,s)\) as $$ V(i,t;j,s) = \frac{C(i,t)}{C(j,s)} - \frac{\sum_k C(k,t) \sum_r C(i,r)}{\sum_k C(k,s) \sum_r C(j,r)} $$

The above variable \(V(i,t;j,s)\) could be easily calculated from the CDRs data and its distribution is expected to be narrow and centered around zero. Fig. 1. shows that this is actual the case.

Fig. 1
Fig. 1: The distribution of the random variable \(V(i,t;j,s)\) is tight and around zero.

Comparison with random data

CDRs data are real numbers greater than zero. The following figure shows the distribution of \(V(i,t;j,s)\) calculated after the substitution of CDRs data with random numbers in the interval \([0,1]\), note that the distribution is different from Fig. 1.

Fig. 1
The distribution of \(V(i,t;j,s)\) with \(C(i,t)\) random numbers in \([0,1]\).

Enhance the model

Add temporal dependence to the spatial factor

The model becomes $$ C(i,t) = A(t) D(i,t) $$ The spatial factor \(D(i,t)\) can be written in terms of the CDRs data by means of the following reasoning. First, eliminate the factor \(A(t)\) by integration over \(i\) $$ \frac{C(i,t)}{\sum_k C(k,t)} = \frac{D(i,t)}{\sum_k D(k,t)} $$ then, under the assumption that the quantity \(\sum_k D(k,t)\) does not depend on time \(t\), we can write $$ D(i,t) \propto \frac{C(i,t)}{\sum_k C(k,t)} $$ Th above assumption generally holds under restrictive border conditions, in our case we are assuming that the Milano grid is wide enough that the dynamics inside the grid could be considered independent from the dynamics outside the grid.

Add spatial dependence to the temporal factor

Following the same reasoning as above, with the only difference that the enhanced model is now \(C(i,t) = A(i,t) D(i)\), by integration over the temporal variable \(t\) and under the assumption that \(\sum_r A(i,r)\) does not depend on \(i\), we have $$ A(i,t) \propto \frac{C(i,t)}{\sum_r C(i,r)} $$

Results

The model \(C(i,t) = A(t) D(i)\) well describes the overall dynamics emerging from CDRs data and it could be used as starting point for further enhancements. In particular, the factors \(D(i)\) and \(A(t)\) are expected to vary on time and space respectively and such dependence can be dealt with separately by means of CDRs data only. Figures 2 and 3 show the distributions of the variables \(A(i,t)\) and \(D(i,t)\).

Fig. 2
Fig. 2: The distribution of the random variable \(A(i,t)\). The two peaks are a conseguence of the daytime/nightime cicle.
Fig. 3
Fig. 3: The distribution of the variable \(D(i,t)\).

Spatial patterns on variable \(D(i,t)\) are expected, much more interesting are temporal patterns emerging during the whole working day. Figures 4 to 7 show the cells with \(D(i,t) > w\), where \(w\) is a cutoff value, at different times, note that the spatial distributions are different.

Fig. 4
Fig. 4: Snapshot of the Milano grid taken at time 6:40. Cells with \(0.0004 > D(i,t) > 0.0006\) are colored.
Fig. 5
Fig. 5: Snapshot of the Milano grid taken at time 10:00. Cells with \(0.0004 > D(i,t) > 0.0006\) are colored.
Fig. 6
Fig. 6: Snapshot of the Milano grid taken at time 13:20. Cells with \(0.0004 > D(i,t) > 0.0006\) are colored.
Fig. 7
Fig. 7: Snapshot of the Milano grid taken at time 21:40. Cells with \(0.0004 > D(i,t) > 0.0006\) are colored.

Vice versa, temporal patterns on variable \(A(i,t)\) are expected, whereas spatial patterns on the Milano grid are not trivial. Figures 8 to 10 show the cells with \(p > A(i,t) > q\) at different times, note the presence of spatial clusters.

Fig. 8
Fig. 8: Snapshot of the Milano grid taken at time 10:00. Cells with \(0.0145 > A(i,t) > 0.0155\) are colored.
Fig. 9
Fig. 9: Snapshot of the Milano grid taken at time 13:20. Cells with \(0.0145 > A(i,t) > 0.0155\) are colored.
Fig. 10
Fig. 10: Snapshot of the Milano grid taken at time 18:20. Cells with \(0.0145 > A(i,t) > 0.0155\) are colored.

Inpact and future evolution

The proposed normalization mechanism is at the base of further analysis targeted to find patterns of the user behaviour throughout the working-day, and helpful in accurate operational planning like network dynamic congestion control, cell-site optimization and intelligent network planning. The ability to create "what if" scenarios based on past trends helps to anticipate and implement necessary network change just ahead of demand curve, prioritize and optimize network investment plan based on service forecast demands.

The model is based on CDRs data, i.e. on data that describes the user activity. Then the model could be improved by the availability of data regarding the user inactivity, in other words the knowledge of the ratio between active users and total users permits a significant enhancement on the interpretation of the model. At the present state indeed the model does not give indication of which of the alternative variables, \(A(i,t)\) or \(D(i,t)\), better express the user behaviour at the given space and time coordinates.

Moreover, in the present analysis it remains to identify geographical regions of interest according to the space-time dynamics of variables \(A(i,t)\) and \(D(i,t)\).