Add nhwc support for NNAPI EP, add concat op, handle concurrent calls to NNAPI model (#4356)
* add support to internally transpose nchw input to nhwc and only transpose back if it is necessary
* more changes in nchw<->nhc, fixed small issue in concat
* Add option for NNAPI to run on [all device]s/[cpu onl]y/[non-cpu only]
* minor code style changes