In the fast-paced digital age, the analysis of document layouts is crucial for automated information extraction and interpretation. Our research focuses on training the MViTv2 transformer model architecture with cascaded mask R-CNN on the BaDLAD dataset. This allows us to extract text boxes, paragraphs, images, and tables from documents. After training on 20365 document images over 36 epochs in a 3-phase cycle, we achieved a training loss of 0.2125 and a mask loss of 0.19. However, our work doesn’t stop at training; we also explore various avenues for enhancing our model. We investigate the effects of rotation and flip augmentation, the benefits of slicing input images before inference, the implications of varying the resolution of the transformer backbone, and the potential of using a dual-pass inference to identify missed text boxes. Through these explorations, we have found that some modifications lead to noticeable performance improvements, while others provide valuable insights for future research.