An Artificial Neural Network-based for Clinical Performance Assessment

Adrian Casillas,1, 2 Stephen Clyman,4 Brian Clauser,4 Yihua Fan,4 R. Stevens1, 3

1Department of Microbiology and Immunology, School of Medicine, University of California, Los Angeles; 2Department of Medicine, Division of Clinical Immunology and Allergy, School of Medicine, University of California, Los Angeles; 3Graduate School of Education and Information Science, CRESST, University of California, Los Angeles; 4National Board of Medical Examiners, Philadelphia; USA


We have used unsupervised artificial neural networks (ANNs) to explore alternative models of student performance and identify areas where such models may complement existing assessment models. One hundred random student performances were selected from a larger database of computer-based clinical scenario (CCS) performances on a case of bacterial meningitis. Classifications resulting from this neural network modeling were consistent with the National Board of Medical Examiners (NBME) model in that highly rated performances (ratings of 7 or 8) were clustered on the neural network output grid. Very low performance ratings shared few common features and were classified at isolated nodes. Several performance clusters with very disparate NBME ratings (ranging from 1 to 8) were electronically recreated as search path maps to visualize the strategies used at different nodes. The neural network clustering appeared to be sensitive to quantitative and qualitative test selections, some reflecting broader behavioral classification. These particular performance clusters did not appear to be coincidental since reproducibility across 3 separately trained networks could be achieved. In generating this performance model from a constructive data analysis approach, rather than through more traditional task analysis, we have validated the existing NBME CCS scoring model and provided further evidence for the utility of ANNs in educational training and assessment settings. These results also suggest that the performance model being created by the NBME scoring criteria is quite complete and that likely additions to this model would emphasize the more behavioral aspects of clinical performance.


The rapid evolution of information technologies is beginning to provide new opportunities for studying the complex behaviors of individuals engaged in complex tasks. Medical diagnosis and management is an example of a very complex task, and multiple data and performance models can be built from the ensuing patient/physician interaction.1, 2 These models differentially address such issues as cost, risk, and the quality of life.3 - 5 Currently, no tasks other than the clinical encounter itself adequately capture the information needed to simultaneously address all the aspects of these different models. Several tasks do exist, however, that are attempting to capture the critical features of each model, and these tasks form the basis of current and proposed medical licensing examinations.

The NBME has introduced the CCS examination, formerly known as CBX, in order to provide a simulated patient experience requiring examinees to continually monitor the patient and make appropriate management decisions.6 - 9 In assessing these tests, an expert rating system is employed where actions are matched with criteria and compared with broader performance standards. Although competence is based on a rating, in reality, the decision of competence is derived from multiple variables to ensure that the analytical scoring model being employed is optimized for the purposes of the examination.10 It must be recognized, however, that many aspects of learning, knowledge, and behavior affect performance, not all of which can be addressed in any particular examination. It is not always clear what problem-solving features may remain unrecognized in return for, perhaps, a level of efficiency when constructing or scoring a problem performance.

While the NBME’s CCS computer simulations have been refined for the purposes of licensure and certification, other more subtle models of student performance may exist within the data. This suggests the possibility for more exploratory and analytical techniques that can be readily applied to "discover" alternative classifications within complex data sets. Most performance assessments or intelligent tutoring systems begin with knowledge skills and cognitive task analysis and create suitable tasks and scoring criteria based on this analysis.10 The very broad nature of the CCS problem space allows alternative approaches toward developing performance models.

We have utilized a constructive data modeling approach that utilizes the pattern recognition capabilities of ANNs to build performance models from existing complex data sets. ANNs are nonparametric techniques that build rich models of complex phenomena through a training and pattern recognition process and are capable of categorizing behavior based on actual performance sequences. Neural networks have had practical utility in solving classification problems with ill-defined categories, where the patterns are often deeply hidden within the data or where there are poorly defined models of behavior.11 In this study, we use neural network analysis to explore the completeness of the NBME CCS scoring model, and determine the validity of using ANN analysis to generate performance classifications from complex data sets in the absence of defined scoring criteria.


The NBME CCS Data Set
The NBME developed the CCS patient management simulations primarily for assessing clinical knowledge and competence of 3rd-year and 4th-year medical students.6, 8 In these simulations, examinees manage a patient in a realistic fashion by requesting various diagnostic tests, therapeutic options, and other clinically relevant items. The series of requested actions is recorded in simulated time, and the resulting transaction list defines the strategy for each particular student. These sequential actions served as the input to train our neural networks.12, 13

Unsupervised Neural Network Analysis of Performance Data
We utilized a self-organizing neural network map made up of a matrix of competitive nodes referred to as a Kohonen layer.14, 15 The network received data as a number of distinct inputs that are digital representations of the paths between the students’ test item selections. The number of inputs is based on the total number of unique test associations represented in the performances used for training.

The neural network training process is iterative where the value of each output node is adjusted based on the magnitude and direction of the input vectors of each performance presented during training. Each time the entire training data is passed through the network, an epoch (1 iteration) is achieved. Each epoch during training results in adjustments of the magnitude and direction of the output vector until the completion of training. The duration of training is empirically derived, and based on our experience, 1,000 to 10,000 epochs was sufficient for achieving consistent classification. Our networks were trained with a data set consisting of 100 performances from a single case.

Following the training process, the same data used to train the network was presented to the network for classification. The Kohonen self-organizing neural network associates each input pattern with a representative output pattern. The winning nodes for the entire set of performance data are summed to produce the topographic representations seen throughout the results. The Kohonen self-organizing neural network was constructed with software libraries from Ward Systems Group (Rockville, MD).

Search Path Map Analysis
We electronically reconstructed the students’ problem-solving strategies using software that produces visual representations of students’ search through a defined problem space by querying the transaction database of the performances and displaying the results in a graphical form.16, 17 Search path maps are displayed as boxes that correspond to actions a student can select while managing a case. The tests can be grouped into a variety of formats to display the use of certain tests or concepts or to show details of the sequence of student selections in a particular test group. The different laboratory tests available were clustered into separate areas of the problem space, as shown in the Results. Individual student performances were overlaid on this template by a series of lines connecting the sequence of items chosen, with the lines going from the upper left-hand corner of a test selection to the lower center of the subsequent test. Thus, tracking a student’s strategy as a series of "From-To" pairs was possible. Where multiple student performances were displayed, the thickness of the lines between test item pairs was proportional to the number of students who made that test selection.13, 18


ANN Classification of Case Performance Data
In order to derive ANN performance classifications, a training set of 100 randomly selected performances of case 139 (meningitis) was used to train an unsupervised network. The training data was then analyzed by the same network to identify performance clusters (Figure 1). We next selectively queried the NBME database for specific performance ratings that could be identified at specific node outputs to isolate the nodes corresponding to high or low ratings. The values ranged from 1 (worst) to 8 (best). We found a contiguous 2-node cluster (node 43/44) composed of highly rated performances (rating wpe7.jpg (713 bytes) 7) (Figure 1A). Our analysis of low ratings (less than 4) revealed very limited clustering (Figure 1B).

Figure 1: Output nodes for performances of problem 139 segregated by (A) rating greater than 4 and (B) rating less than 4 (failing). The major nodes are shown with arrows pointing to their location on the grid, and the number of performances at each node is designated. Search path maps selected by rating for problem 139. Highly rated performances (rating = 7 or 8) are shown (C) beside those failing performances (rating below 3) (D). Note the lack of "CSF" domain usage in (D).

To understand the actual strategies accounting for the classifications, the performance strategy was overlaid on the template of the functional problem. In this display, each number-coded box, represented a specific test (for example, History and Physical items, Blood Cell Tests, Cerebrospinal Fluid [CSF] tests) were grouped on the template into respective categories. With connecting lines between the selected test items requested, we were able to reconstruct the examinee strategies as search path maps (Figure 1C, D).

Our initial observation was that the degree of thoroughness within the area of CSF tests for this meningitis case was associated with the overall rating. Performance ratings of 7 and 8 (i.e., at nodes 43 and 44) were uniformly associated with the selection of many CSF-associated items (Figure 1C), while ratings below 3 (i.e., node 89) were associated with minimal or no test ordering in the CSF domain (Figure 1D). As expected, the lowest rankings showed no usage in the CSF test domain, suggesting a failure to even recognize the problem as one of meningitis.

Clustering of Performances Across Networks
There is often significant variability in the performance of a series of neural networks trained with the same data set due to initial training conditions and the properties of the data itself. We were interested to know how well the network-assigned classifications were retained when additional neural networks were trained with the same data. Two additional neural networks were trained with the same architecture (i.e., 10,000 epochs with a 10 x 10 output). We selected major nodes from the ANN output of our original network (Network #1) in order to compare these performance clusters as they were classified on 2 independently trained neural networks (# 2 and #3). We expected that the data would reflect the fact that clusters in 1 network would be represented in a topologically ordered manner, meaning that 1 cluster may be represented by physically close (adjacent) nodes between different neural networks if the data represent similar patterns.14, 15 Five major output nodes generated by Network #1 (nodes 10, 21,25, 29, 43/44, and 89) were compared to the corresponding outputs generated for the 2nd and 3rd networks (Table 1).

Table 1: Comparison of performance clustering between specific nodes of Network #1 with 2 independently trained networks (#2 and #3)


In node 10 of Network 1, 10 of 11 performances (91%) clustered on Network #2 while 9 of 11 (82%) similarly clustered on Network #3. Between Networks #2 and #3, there was 73% preservation of clustering. Node 21 of Network #1 showed only 50% cluster correlation with either Network #2 or #3. However, the latter 2 networks were identical in terms of their classification for the performances, since the same performances were clustered within each of 2 nodes. In node 25, there was a 63% correlation of Network #1 with Networks #2 and #3, but the latter 2 neural networks, again, were identical in terms of the classifications generated with the respective performances. Node 43/44, where the highest NBME ratings were found on Network #1, showed that 8 of 12 performances (67%) clustered on Network #2; however, between Network #1 and #3, 10 of the 12 performances (83%) were similarly classified at the adjacent nodes 25 and 45 of Network #3. Furthermore, between Networks #2 and #3 there was 83% preservation of clustering. As expected, the node representative of the poorest NBME ratings, node 89, correlated 100% with the classifications generated by Networks #2 and #3. Node 29 was unusual in that the best correlation with either of the 2 independent networks was 56%. These clusters were not retained between Networks #2 and #3, since only 33% of the performances clustered among these 2 networks.

Figure 2: Search path maps of node 29 performances. Node 29 was segregated into poorly rated (A) and highly rated (B) performances. The use of superfluous tests is shown in the captions with arrows indicating each group of tests. Total neural network outputs of all nodes for 2 performances with a primary peak at node 29. A performance with a low rating of 2 (C) is shown in contrast to a performance with a rating of 8 (D). Note the lack of secondary nodal output in (C) compared to the significant secondary peak at node 43 in (D).

Inconsistent Aspects of the NBME and ANN Performance Assessment Models
There were other nodes that could not be explained as classifications extending directly from the NBME expert-rater’s model. One example of this type of performance is characteristic of the group clustered at node 29 which failed to give consistent clustering across networks.

The search path map analysis for performances at this node (Figure 2A, B) indicated the use of excessive test item selections throughout several domains. Regardless of the search strategy employed within the CSF domain, the consistent feature of these performances was the overuse of a number of tests including serum lipid profiles, urine tests, culture and sensitivity, and additional blood chemistries whether the rating was low (Figure 2A) or high (Figure 2B).

The fact that there was a cluster of performances at node 29 with disparate ratings also prompted us to investigate the possibility that there may be additional information within the network output useful for characterizing the quality of these performances. Up to this stage, the network output we had focused on was the highest nodal output rather than that of the entire output space (i.e., all 100 nodes on the 10 x 10 grid). In the next series of experiments, instead of viewing the outcome of a performance only by its ANN assigned winning node, all of the outputs generated at the final step of the performance were visualized. We observed that the final step outputs for poorly rated performances (Figure 2C) show only a single area of significant output (at node 29) with a poorly defined secondary output area. In contrast, a significant secondary peak (Figure 2D) which was at node 43 characterized the excellent performances at node 29. Through the visualization of the entire final output, it is apparent that the winning output at node 29 grouped the 2 performances by associating them with the utilization of excessive test usage, as was shown above. The development of a significant secondary peak in a defined area associated with an area of high ranking (i.e., node 43/44) served to differentiate the superior strategies from the weaker strategies at the same node.


The idea of mental models can be useful when discussing procedures that occur through the acquisition and use of knowledge, and it is this process that we have addressed in our study. These models are represented by the recurrent performance patterns or strategies that relate external events to what is already known. These models are dynamic, continually being updated and modified with experience in conjunction with the nature of the problem being encountered at a particular time. Our approach to generating a neural network model of patient management was deliberately retrospective in approach. The data had been previously collected, but the nature of the case and the coding of the actions was unknown during the training and preliminary classification of student performance. In essence, we worked from the data to the model, a different approach from traditional scoring and task analysis.

We were able to find consistencies among our neural network model and the expert-rater model established by the NBME. The best agreement between the 2 models was observed among those performances with high ratings at node 43/44 (Figure 1A). It was evident that among the training performances at this node cluster, a high rating by expert raters was predictable when the output was classified to node 43/44. There was no single node that predicted a low rating. In fact, the lack of performance clustering was often predictive of a poor performance (Figure 1B). This indicates that incomplete or poorly formed strategies will more likely reflect elements of inefficiency or error, and further, is not an inconsistent feature between the CCS expert and neural network models since a poor rating reflects the lack of a cohesive and productive strategy. In fact, the uncued manner of testing employed in the CCS examinations is more likely to lead to more random approaches when a mental model is inadequate or incomplete.

Further validation of our claim for the consistency of the NBME model and the neural network model was shown through the training of multiple networks where we noted retention of performance clustering in 2 additional networks (Table 1). These had been trained under the same conditions as the 1st network, and we found that in 4 of the 5 major nodal outputs analyzed, a greater than 80% correlation was found in the performances at 1 node matching with the same performances classified by a different network. This finding could be verified with at least 2 of the 3 networks analyzed.

We were able to observe areas of inconsistency among the 2 models represented by the NBME and the neural network. This was most apparent at node 29 where expert-assigned ratings ranging from 2 (poor) to 8 (excellent) were classified together. When the search path maps for these performances were generated, we observed that the network classification was based on an excessive number of tests used (Figure 2). This classification by the neural network brought features of performances together where strategies reflected exhaustive, unproductive searches (Figure 2A) as well as those that show the development of a productive approach within, perhaps, a less focused strategy (Figure 2B).

We felt that transition in progress was emerging at this node, and we were able to observe a meaningful subclassification within the strategies at node 29. These transitions were seen when the neural network output for each node in the 10 x 10 output grid was visualized (Figure 2). In this case, the development of a significant secondary output peak at node 43/44 indicated that these particular performances contained similarities to excellent performances (as in Figure 2B), but due to an overwhelming number of tests ordered, the final output was at node 29. This is an important realization when considering the true dynamic nature of learning and comprehension.

This example also serves to illustrate the nature of unsupervised neural networks in classifying patterns that may be hidden within data.11 ANNs can aid the assessment process by capturing the full repertoire of performances in a population even within a specific case in order to provide information about problem-solving behavior. An ANN model, as a resource in the assessment of complex, real-world clinical problem solving, is compatible and complementary with currently employed models for assessment and recognizes a behavioral dimension that has been difficult to objectify. It is also likely that the model studied here will be useful with other complex performance data sets where behaviors are likely to be hidden within the data.


  1. Elstein AS, Kleinmuntz B, Rabinowitz M, McAuley R, Murakami J, Heckerling PS, et al. Diagnostic reasoning of high- and low-domain-knowledge clinicians: a reanalysis. [Published erratum appears in Med Decis Making. Jul.-Sep. 1993;13(3):267.] Med Decis Making. 1993;13:21-29.

  2. Groen GJ, Patel VL. The relationship between comprehension and reasoning in medical expertise. In: Chi MTH, Glaser R, Farr MJ, eds. The Nature of Expertise. Hillsdale: L. Erlbaum Associates; 1988:287-310.

  3. Elstein AS. Beyond multiple-choice questions and essays: the need for a new way to assess clinical competence. Acad Med. 1993;68:244-249.

  4. Fehrsen GS, Henbest RJ. In search of excellence—expanding the patient-centred clinical method—a 3-stage assessment. Fam Pract. 1993;10:49-54.

  5. Lantz MS, Chaves JF. What should biomedical sciences education in dental schools achieve? J Dent Educ. 1997;61:426-433.

  6. Clauser BE, Subhiyah RG, Nungester RJ, Ripkey DR, Clyman SG, McKinley D. Scoring a performance-based assessment by modeling the judgements of experts. JEM. 1995;32(no. 4):397-415.

  7. Clauser BE, Subhiyah RG, Piemme TE, Greenberg L, Clyman SG, Ripkey D, et al. Using clinician ratings to model score weights for a computer-based clinical-simulation examination. Acad Med. 1993;68:S64-S66.

  8. Clauser BE, Swanson DB, Clyman SG. The generalizability of scores from a performance assessment of physicians' patient management skills. Acad Med. 1996;71:S109-S111.

  9. Dunbar SB, Koretz DM, Hoover HD. Quality control in the development and use of performance assessments. Appl Meas in Educ. 1991;4:289-303.

  10. Mislevy RJ, Gitomer DH. The role of probability-based inference in an intelligent tutoring system. In: Anonymous User Modeling and User-Adapted Interaction. Netherlands: Kluwer Academic Publishers; 1996:253-282.

  11. Rumelhart DE, McClelland JL. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge: MIT Press; 1986.

  12. Stevens RH, Lopo AC. Artificial neural network comparison of expert and novice. Proc Annu Symp Comput Appl Med Care. 1994;64-68.

  13. Stevens RH, Lopo AC, Wang P. Artificial neural networks can distinguish novice and expert strategies during complex problem solving. J Am Med Inform Assoc. 1996;3:131-138.

  14. Kohonen T. Self Organization and Associative Memory. Berlin: Springer; 1989.

  15. Lawrence J. Introduction to Neural Networks. Nevada City, CA: California Scientific Software Press; 1993.

  16. Stevens RH. Search path mapping: a versatile approach for visualizing problem-solving behavior. Acad Med. 1991;66(suppl 9):s72-s75.

  17. Stevens RH, Kwak AR, McCoy JM. Evaluating preclinical medical students by using computer-based problem-solving examinations. Acad Med. 1989;64:685-687.

  18. Stevens RH, McCoy JM, Kwak AR. Solving the problem of how medical students solve problems. MD Computing. 1991;8(1):13-20