The Visual Speech Recognition (VSR) system performance is highly influenced by the selection of visual features. These features are categorized into static and dynamic features. This work proposes to exploit both lip shape (static-geometric features) as well as the temporal sequence of lip movements (dynamic-motion features) to build a combined VSR system with fusion both at feature level and model level. The digit dataset for VSR system is evaluated on the benchmark (using Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and Zernike Moments (ZM)) systems. First, the Motion History Image (MHI) is calculated from all visemes from which wavelet and Zernike coefficients are extracted and modeled using a simple GMM L-R HMM. This proposed method shows a significant improvement in performance of 85% for MHI-DWT based features, 74% for MHI-DCT and 80% for MHI-ZM features. Geometric features are extracted using an Active Shape Model (ASM). Two types of fusion, namely feature fusion and model fusion are used. In feature level fusion, the motion features (MHI-DWT, MHI-DCT, and MHI-ZM) with geometric features (ASM) and modeled using GMM L-R HMM. The performance improves for combined features with an accuracy of 96.5% for DWT-ASM, 84% for DCT-ASM, and 93% for ZM-ASM. Model level fusion is performed using a two stream HMM model with stream weight of DWT-ASM, DCT-ASM, and ZM-ASM features. A weighted model level fusion results in further improvement, with an accuracy of 98.2% for DWT-ASM, 85% for DCT-ASM and 94.5% for ZM-ASM. The proposed work result achieves high recognition for VSR systems compared to the benchmark systems (DWT, DCT, and ZM). © 2020 The Authors. Published by Elsevier B.V.