Significantly improve performance by not using broadcasting for very small arrays etc (#115)
* improve performance by not using broadcasting for very small arrays
* further improve the performance of creating the graph
* added some comments, fixups from review