Tuesday, May 2, 2017

A R named list is not a python dictionary

In the previous post, mpiktas posted a comment that included this question "Isn't Python dictionary a named list in R?". The short answer is no, but the reason why is worth some more explanation.

At first sight, a named list does look a little bit as a python dictionary, it allows to retrieve values inside a list using names (i.e. strings) instead of by numerical index. The code below is also a good example on how it is possible to mix R and python code inside a Jupyter notebook

In [1]:
%load_ext rpy2.ipython
In [2]:
%R RL <- list('one'=1, 'two'=2, 'three'=3) ; RL[['one']] 
Out[2]:
array([ 1.])
In [3]:
PD = {'one': 1, 'two':2, 'three': 3} ; PD['one']
Out[3]:
1

But a dictionary in python can do more, the keys can be any immutable type, so you can use numbers and tuples for keys, even in one dictionary

In [4]:
PD = {'one': 1, 2:2, (3,'three') :  3 } ; PD[(3,'three')]
Out[4]:
3

And you can add by name into an existing dictionary.

In [5]:
PD['four'] = 4 ; PD
Out[5]:
{'one': 1, 2: 2, (3, 'three'): 3, 'four': 4}

You need a two step approach in R, or at least I don't know a single step approach to add a named element

In [6]:
%R RL[4] <- 4 ; names(RL)[4] <- 'four' ; names(RL)
Out[6]:
array(['one', 'two', 'three', 'four'], 
      dtype='<U5')

But there is a more fundamental difference, the python dictionary uses a hash table implementation so that access to a key is independent of the size of the dictionary, while the named list in R seems to use a sorted table of names. This makes access by names (very) slow compared to access by index in R. Here is an example, that also shows how to mix intelligently R and python.

The code builds (large) equivalent structures, and a random subset is defined using R (easier in R) and passed to the python code. Then we measure the access time using different approaches. In python, we use both a list and a dictionary.

In [7]:
import time
import pandas
nBits = [18,19,20,21]
columns = ["Size", "Python dict by key", "Python list by index", "R list by index", "R list by key"]
results = pandas.DataFrame(columns=columns, index=nBits)
nSamples = 1000
for nBit in [18,19,20,21]:
    n = (1<<nBit)
    results["Size"][nBit] = n
    %R -i nSamples -o samples -i n   samples <- sample(1:(n-1), nSamples, replace = FALSE) 
    %R -o keys                       keys <- as.character(samples)
    keys = list(keys)
    PL = list(range(n))
    start = time.perf_counter()
    S = sum([PL[i] for i in samples ])
    stop = time.perf_counter()
    results["Python list by index"][nBit] = (stop - start) / nSamples * 1e6
    PD = { str(x):x for x in PL}
    start = time.perf_counter()
    S = sum([PD[k] for k in keys])
    stop = time.perf_counter()
    results["Python dict by key"][nBit] = (stop - start) / nSamples * 1e6
    %R -i n -o RL RL <- as.list(1:n)
    %R            start <- Sys.time() ; S <- 0 ; for (i in samples) { S <- S + RL[[i]] } ; stop <- Sys.time()
    %R -o idelay  idelay <- stop - start
    %R            names(RL) <- as.character(RL)
    %R            start <- Sys.time() ; S <- 0 ; for (k in keys ) { S <- S + RL[[k]] } ; stop <- Sys.time()
    %R -o kdelay  kdelay <- stop - start  
    results["R list by index"][nBit] = idelay[0] / nSamples * 1e6
    results["R list by key"][nBit] = kdelay[0] / nSamples * 1e6
results
Out[7]:
Size Python dict by key Python list by index R list by index R list by key
18 262144 0.447617 0.257533 0.998974 1298
19 524288 0.659162 0.341844 2.00105 3028.03
20 1048576 0.758547 0.357429 8.03399 6471.99
21 2097152 0.857677 0.415936 1.02687 13096

The results have sometimes outliers, but the main result is that access by key in R is both slow and increasing with the list size.

There is more to say, in particular there are data structures in R that are based on hash tables, but they remain less flexible than a python dictionary. I am not the first person making this type of analysis, you can check e.g.

Sunday, March 5, 2017

Python and R for code development

The previous post glossed about why I now prefer Python to write code, including for a module like logopt. This post explains in more details some specific differences where I prefer one of these two languages:
  • 0-based indexing in python versus 1-based indexing in R.  This may seem a small difference but for me, 0-based indexing is more natural and results in less off by one errors.  No less than Dijkstra opines with me on 0-based indexing.
  • = versus <- for assignment.  I like R approach here, and I would like to see more languages doing the same.  I still sometimes end up using = where I wanted ==.  If only R would allow <- in call arguments.
  • CRAN versus pypi
    • CRAN is much better for the user, the CRAN Task Views is a gold mine, and in general CRAN is a better repository, with higher quality packages.
    • But publishing one CRAN is simply daunting, and the reason logopt remained in R-Forge only.  The manual explaining how to write extensions is 178 pages long.
  • Python has better data structures, especially the Python dictionary is something I miss whenever I write in R.  Python has no native dataframe, but this is easily taken care of by importing pandas.
  • Object orientation is conceptually clean and almost easy to use in Python, less so in R.
  • Plotting is better in R.  There are some effort to make Python better in that area, especially for ease of use.  Matplotlib is powerful but difficult to master.
  • lm is a gem in R, the simplicity with which you can express the expressions you want to model is incredible
All in all, I prefer coding in Python.  This is a personal opinion of course, and R remains important because of some packages, but for more general purpose tasks, Python is simpler to use, and that translates in being more productive. 

Monday, February 20, 2017

Rebooting with Python and Jupyter

This blog has been inactive for a long time for essentially two reasons:

  • I was not very happy with the quality of the results
    • The source code was not showing very nicely
    • It was difficult to get a nice display, including for pictures and mathematical expressions
  • I started to use Python almost exclusively
    • R is a nice language, but it is not a general purpose language, some tasks are hard in R compared to Python
    • At the other hand, Python has steadily improved in the area of data processing, with pandas providing something equivalent to the R dataframe
But now, there is a good way to solve both problems, Jupyter notebooks combined with the ability to directly include HTML in a post.  So it is time for a reboot.

Jupyter was originally know as iPython but has evolved to support many programming languages, including R.  This allows now to develop a notebook, possibly based on multiple languages, then convert it for posting, while keeping the original notebook available for people that wants a more interactive experience.  The development process is much simpler that way that it used to be for earlier posts.

As an example, the rest of this post is this notebook converted to HTML.  Note that the notebook contains both R and Python code interacting in an almost seamless way.  How to achieve that result will be explained in later posts.

In [1]:
%load_ext rpy2.ipython
In [2]:
%%R -o x -o xik -o n -o pik
# figure 8.1 of Cover "Universal Portfolios"

library(logopt)
data(nyse.cover.1962.1984)
n <- nyse.cover.1962.1984
x <- coredata(nyse.cover.1962.1984)
xik <- x[,c("iroqu","kinar")]
nDays <- dim(xik)[1]
Days <- 1:nDays
pik <- apply(xik,2,cumprod)
plot(Days, pik[,"iroqu"], col="blue", type="l", 
     ylim=range(pik), main = '"iroqu" and "kinar"', ylab="")
lines(Days, pik[,"kinar"], col="red")
grid()
legend("topright",c('"iroqu"','"kinar"'),
       col=c("blue","red"),lty=c(1,1))
In [3]:
print(x)
print(type(x))
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.ion()
plt.figure(figsize=(6,4))
plt.plot(pik)
plt.grid()
[[ 1.01515  1.02765  1.04183 ...,  1.00578  0.99697  0.99752]
 [ 1.01493  1.04036  0.98905 ...,  1.00958  0.99088  1.00248]
 [ 1.       0.97629  0.97786 ...,  1.       1.02761  0.99752]
 ..., 
 [ 0.99029  0.9966   0.99605 ...,  0.99216  1.00461  0.99273]
 [ 0.99265  1.00683  1.      ...,  0.99209  1.02752  1.00366]
 [ 0.99753  1.00339  1.01984 ...,  1.01195  1.       0.99635]]
<class 'numpy.ndarray'>

Thursday, November 22, 2012

Escaping the simplex, part 1

Before tackling the main subject, two quick notes:

  • I did not post for quite a while in part because I followed the Coursera online course Introduction to Computational Finance and Financial Econometrics.  It was a nice refresher, extremely well presented, and including some R.  This did consume enough time to make posting cumbersome though.
  • I started using R studio, and I am quite happy with it under Windows.  This has also impacted my code, as R studio automatically stacks your plots.
The main topic of this post is to motivate an escape from the simplex, i.e. allow for weights either larger than 1 or smaller than 0.  Generally, this implies going short on one asset deemed less favorable and shift the short proceed to a deemed more favorable asset.

The main reason to do so is that the Best Constant Rebalanced Portfolio (BCRP) may lie outside.  This is a well known feature in the more traditional mean-variance analysis, where the constrained portfolio is sparser than the unconstrained one.  We have a similar effect for the BCRP, the constrained optimization results in a sparse portfolio, i.e. many weights are (close to) zero.

We can illustrate this by calculating the weights of the constrained BCRP.  The BCRP function in logopt is currently only supporting optimization on the simplex, and we can reuse the code originally presented in Universal portfolio, part 10  to accumulate the portfolio weights for BCRP across multiple combinations of the reference data.  This shows that the proportion of (close to) zero weights is high.

The code below as is runs for about 7 hours because of the large number of combinations, you can reduce the size of the problem if wanted by editing TupleSizes.

Note that in the past Syntax Highlighter had problems with R-bloggers, you might need to go to the original page to view the code.

# assume the use of RStudio (no explicit management of graphic devices)

library(logopt)
data(nyse.cover.1962.1984)
x <- coredata(nyse.cover.1962.1984)
nStocks <- dim(x)[2] 

EvaluateOnAllTuples <- function(ListName, TupleSizes, fFinalWealth, ...) {
  if (exists(ListName) == FALSE) {
    LocalList <- list()
    for (i in 1:length(TupleSizes)) {
      TupleSize <- TupleSizes[i]
      ws <- combn(x=(1:nStocks), m=(TupleSize), FUN=fFinalWealth, simplify=TRUE, ...)
      LocalList[[i]] <- ws
    }
    assign(ListName, LocalList, pos=parent.frame())
  }
}

TupleSizes <- c(2,3,4,5)

# evaluate the sorted coefficients for best CRP
SortedOptB <- function(cols, ...) {
  x <- list(...)[[1]] ;
  x <- x[,cols]
  b <- bcrp.optim(x)
  return(sort(b))
}

TupleSizes <- c(2,3,4,5)
EvaluateOnAllTuples("lSortedOptBSmall", TupleSizes, SortedOptB, x)

TupleSizes <- c(nStocks-3,nStocks-2,nStocks-1,nStocks)
EvaluateOnAllTuples("lSortedOptBLarge", TupleSizes, SortedOptB, x)

Colors <- c("red", "green", "blue", "brown", "cyan", "darkred","darkgreen")

for (iL in 1:length(lSortedOptBSmall)) {
  nCoeff <- nrow(lSortedOptBSmall[[iL]])
  E <- ecdf(lSortedOptBSmall[[iL]][1,])
  Title <- sprintf("Cumulative PDF for all sorted weights for BCRP of %d assets", nCoeff)
  plot(E, col=Colors[1], xlim=c(0,1), pch=".", main = Title, xlab = "relative weight")
  cat(sprintf("Combinations of %d assets\n", nCoeff))
  for (iB in 1:nCoeff) {
    E <- ecdf(lSortedOptBSmall[[iL]][iB,])
    lines(E, col=Colors[iB], pch=".")
    if (iB < nCoeff) {
       cat(sprintf("Percent of portfolio with %d weight(s) smaller than 0.001: %f\n", iB, E(0.001) ))
    }
  }
}

SmallOf2 <- lSortedOptBSmall[[1]][1,]
hist(SmallOf2,n=25,probability=TRUE, main="Histogram of smallest weight for two assets", 
     xlab="Smallest coefficient")
lines(density(SmallOf2, bw=0.02),col="blue")

# for the large ones, we show only the largest coefficients and show them in 
# opposite order to find how many coefficients are not always insiginifcant
for (iL in 1:length(lSortedOptBLarge)) {
  nCoeff <- nrow(lSortedOptBLarge[[iL]])
  E <- ecdf(lSortedOptBLarge[[iL]][nCoeff,])
  Title <- sprintf("Cumulative PDF for largest sorted weights for BCRP of %d assets", nCoeff)
  plot(E, col=Colors[1], xlim=c(0,1), pch=".", main = Title, xlab = "relative weight")
  cat(sprintf("Combinations of %d assets\n", nCoeff))
  for (iB in 1:nCoeff) {
    iCoeff <- nCoeff-iB+1
    E <- ecdf(lSortedOptBLarge[[iL]][iCoeff,])
    if (E(0.001) < 0.9999) {
      lines(E, col=Colors[iB], pch=".")
      cat(sprintf("Percent of portfolio with %d weight(s) smaller than 0.001: %f\n", iCoeff, E(0.001) ))
    }
  }
}

Only a subset of the results are shown below, first the cumulative probability function for the weights of the coefficients for all combinations of 5 assets.  The graph shows that the smallest coefficient is almost always 0 and the second coefficient is also very small all of the time.  The textual output shows that the second coefficient is insignificant 91% of the time or equivalently 91% of the best portfolio only uses 3 of the 5 possible assets.


Combinations of 5 assets
Percent of portfolio with 1 weight(s) smaller than 0.001: 0.996499
Percent of portfolio with 2 weight(s) smaller than 0.001: 0.914086
Percent of portfolio with 3 weight(s) smaller than 0.001: 0.499448
Percent of portfolio with 4 weight(s) smaller than 0.001: 0.067975



The density for the two asset case shows clearly that there is a high peak at zero.


Finally the textual output and the ECDF shows that for 33 possible assets, only up to 7 are present in the BCRP


Combinations of 33 assets
Percent of portfolio with 33 weight(s) smaller than 0.001: 0.000000
Percent of portfolio with 32 weight(s) smaller than 0.001: 0.000000
Percent of portfolio with 31 weight(s) smaller than 0.001: 0.000000
Percent of portfolio with 30 weight(s) smaller than 0.001: 0.000000
Percent of portfolio with 29 weight(s) smaller than 0.001: 0.065126
Percent of portfolio with 28 weight(s) smaller than 0.001: 0.707843
Percent of portfolio with 27 weight(s) smaller than 0.001: 0.947059


All this points to the fact that most of the weights of the BCRP end up on the boundary of the simplex, and that removing that specific constraint would get an even better solution, at least in term of terminal wealth.  We'll investigate further this in future posts.

Sunday, September 23, 2012

Universal portfolio, part 11

First an apology, the links to the Universal Portfolio paper have stopped working.  This is because the personal webpage of Thomas Cover at Stanford has been taken down, but fortunately the content moved elsewhere.  The new link is Universal Portfolio and hopefully this one will be stable.

Note that there are many available copies on the web but most (like this one) are for something that seems to be a slighly reworked version dated October 23 1996.  The text appears mostly identical to the published version, but it does not include the figures.

In the rest of this post, I discuss the data used by Cover.  That data is included in logopt as nyse.cover.1962.1984.  It contains the relative prices for 36 NYSE stocks between 1962 and 1984.


> range(index(nyse.cover.1962.1984))
[1] "1962-07-03" "1984-12-31"
> colnames(nyse.cover.1962.1984)
 [1] "ahp"    "alcoa"  "amerb"  "arco"   "coke"   "comme"  "dow"    "dupont" "espey"  "exxon"  "fisch"  "ford"   "ge"     "gm"     "gte"    "gulf"   "hp"    
[18] "ibm"    "inger"  "iroqu"  "jnj"    "kimbc"  "kinar"  "kodak"  "luken"  "meico"  "merck"  "mmm"    "mobil"  "morris" "pandg"  "pills"  "schlum" "sears" 
[35] "sherw"  "tex" 

The names are not stickers, some guessing and with some help from an other person using the series gives the table below (and if anybody knows about the one without expansion yet. please post a comment).

Abbreviation Company name Current ticker
ahp ? ?
alcoa Alcoa AA
amerb American Brands
aka Fortune Brands
-
arco ? ?
coke Coca-Cola KO
comme Commercial Metals CMC
dow Dow Chemicals DOW
dupont DuPont DD
espey Espey Manufacturing ESP
exxon Exxon Mobil XOM
coke Coca-Cola KO
fisch Fischbach Corp -
ford Ford F
ge General Electric GE
gm General Motors GM*
gte GTE Corporation -
gulf Gulf Oil (now Chevron) CVX
hp Hewlett-Packard HPQ
ibm IBM IBM
inger Ingersoll-Rand IR
iroq Iroquois Brands -
jnj Johnson & Johnson JNJ
kimbc Kimberly-Clark KMB
kinar Kinark? -
kodak Eastman Kodak EKDKQ
luken Lukens? -
meico ? ?
merck Merck MRK
mmm 3M MMM
mobil Exxon Mobil XOM
morris Philip Morris PM
pandg Procter & Gamble PG
pills Pillsbury, now part of General Mills -
schlum Schlumberger SLB
sears Sears Holdings SHLD
sherw Sherwin-Williams SHW
tex Texaco, now Chevron CVX

There is a lot of diversity across the different stocks, we saw that in two ways:

  • by showing the global time evolution of all stocks in time
  • by showing the growth rate at two times separated by N market days (shown as a price relative between the two dates).

# Some statistics on the NYSE series

library(logopt)
x <- coredata(nyse.cover.1962.1984)
w <- logopt:::x2w(x)
nDays <- dim(x)[1]
nStocks <- dim(x)[2] 
Days <- 1:nDays
iWin <- 1 ; plot(1:10)
Time <- index(nyse.cover.1962.1984)

# for each stock calculate:
# - min, max
# - average geometric return

MaxFinal <- max(w[nDays,])
MinFinal <- min(w[nDays,])
MaxAll <- max(w)
MinAll <- min(w)

if(length(dev.list()) < iWin) { x11() } ; iWin <- iWin + 1 ; dev.set(iWin) ;
plot(Time, w[,1], col="gray", ylim=range(w), log="y", type="l")
for (i in 1:nStocks) {
 lines(Time, w[,i], col="gray")
 if (w[nDays,i] == MaxFinal) { cat(sprintf("Stock with best final value: %s finishing at %.2f\n", colnames(w)[i], MaxFinal)) ; iMax <- i }
 if (w[nDays,i] == MinFinal) { cat(sprintf("Stock with worst final value: %s finishing at %.2f\n", colnames(w)[i], MinFinal)) ; iMin <- i }
 if (max(w[,i]) == MaxAll) { cat(sprintf("Stock with best peak value: %s at %.2f\n", colnames(w)[i], MaxAll)) }
 if (min(w[,i]) == MinAll) { cat(sprintf("Stock with worst valley value: %s at %.2f\n", colnames(w)[i], MinAll)) } 
}

lines(Time, w[,iMax], col="green")
lines(Time, w[,iMin], col="red")
lines(Time, apply(w,1,mean), col="blue")
grid()

# do a summary across n quotes
nDelta <- 1200
wD <- w[(nDelta+1):nDays,] / w[1:(nDays-nDelta),]
Time <- Time[1:(nDays-nDelta)]
MaxDAll <- max(wD)
MinDAll <- min(wD)
if(length(dev.list()) < iWin) { x11() } ; iWin <- iWin + 1 ; dev.set(iWin) ;
plot(Time, wD[,1], col="gray", ylim=range(wD), log ="y", type="l")
for (i in 1:nStocks) {
 lines(Time, wD[,i], col="gray")
 if (max(wD[,i]) == MaxDAll) { cat(sprintf("Stock with best gain on %s days: %s at %.2f\n", nDelta, colnames(w)[i], MaxDAll)) }
 if (min(wD[,i]) == MinDAll) { cat(sprintf("Stock with worst lost on %s days: %s at %.2f\n", nDelta, colnames(w)[i], MinDAll)) } 
} 
lines(Time, apply(wD,1,mean), col="blue")

grid()

This gives the following textual answer and graphs.  Note that there are many alternate ways to present this information, in particular the package PerformanceAnalytics.


Stock with worst final value: dupont finishing at 3.07
Stock with worst valley value: meico at 0.26
Stock with best final value: morris finishing at 54.14
Stock with best peak value: schlum at 90.12
Stock with best gain on 1200 days: espey at 15.84
Stock with worst lost on 1200 days: meico at 0.07



This sequence forms a nice reference covering a long period of time, and has been used in many studies of portfolio selection algorithms.  But the series has a number of serious problems:
  • Survivorship bias
  • The time range corresponds to a time where quotes were not yet decimal.