We address the problem of sequentially selecting and observing processes from a given set to find the anomalies among them. The decision maker observes one process at a time and obtains a noisy binary indicator of whether or not the corresponding process is anomalous. In this setting, we develop an anomaly detection algorithm that chooses the process to be observed at a given time instant, decides when to stop taking observations, and makes a decision regarding the anomalous processes. The objective of the detection algorithm is to arrive at a decision with an accuracy exceeding a desired value while minimizing the delay in decision making. Our algorithm relies on a Markov decision process defined using the marginal probability of each process being normal or anomalous, conditioned on the observations. We implement the detection algorithm using the deep actor-critic reinforcement learning framework. Unlike prior work on this topic that has exponential complexity in the number of processes, our algorithm has computational and memory requirements that are both polynomial in the number of processes. We demonstrate the efficacy of our algorithm using numerical experiments by comparing it with the state-of-the-art methods.