[NPUW]Optimize pipeline initialize time in online partitioner. (#34958)
### Details:
`getPartitioning` was running very slowly when the graph contained a
large number of nodes.
This PR optimized several hotspots to improve performance at scale.
- `hasCycle()`: add O(1) fast-path for single-consumer producers; fix
DFS to mark visited at push-time to avoid redundant stack pushes
- `mergeUniques()`: replace per-group O(V) getRepGroups() scan with a
pre-built rep-tag->GPtrSet index, reducing total cost from O(V^2) to
O(V)
- `interconnect()`: use const ref for ports_map to avoid deep copy on
each call
- `metaInterconnect()`: cache MetaInterconnectIO; invalidate on layer
changes
- `getMetaDesc()`: add cache held by snapshot, to avoid re-stringifying
same node
~30x improvement for 256 chunk size:
- 16K context: 200s -> 7s
- 8K context: 58s ->2.5s
### Tickets:
- *[EISW-209691](https://jira.devtools.intel.com/browse/EISW-209691)*
### AI Assistance:
- *AI assistance used: no / yes*
- *If yes, summarize how AI was used and what human validation was
performed (build/tests/manual checks).*
---------
Signed-off-by: intelgaoxiong <xiong.gao@intel.com>