Energy and bandwidth are limited resources in wireless sensor networks, and communication consumes significant amount of energy. When wireless vision sensors are used to capture and transfer image and video data, the problems of limited energy and bandwidth become even more pronounced. Thus, message traffic should be decreased to reduce the communication cost. In many applications, the interest is to detect composite and semantically higher-level events based on information from multiple sensors. Rather than sending all the information to the sinks and performing composite event detection at the sinks or control-center, it is much more efficient to push the detection of semantically high-level events within the network, and perform composite event detection in a peer-to-peer and energy-efficient manner across embedded smart cameras. In this paper, three different operation scenarios are analyzed for a wireless vision sensor network. A detailed quantitative comparison of these operation scenarios are presented in terms of energy consumption and latency. This quantitative analysis provides the motivation for, and emphasizes (1) the importance of performing high-level local processing and decision making at the embedded sensor level and (2) need for peer-to-peer communication solutions for wireless multimedia sensor networks.