Beyond the Bar: Measuring Real-World Legal Judgment

By Michael Pusic, J.D. candidate, Harvard Law School

Image by Felicia Quan, J.D. candidate, Harvard Law School

You do a lot of multiple choice questions to become a lawyer. In the U.S., you take the SATs to get into college. You then take the LSAT (or GRE) to get into law school. Three years later, you take the bar exam to become a licensed attorney.  

But after that, the standardized testing ends. There are a lot of reasons for this, but at least one is the traditional wisdom that the day-to-day work of a lawyer is harder to evaluate via standardized multiple-choice questions. Educational research confirms the intuition that it is easier to test factual knowledge than real-world performance through multiple-choice questions. 

Novel research in medical education challenges this intuition, suggesting that carefully selected standardized multiple-choice questions can evaluate many of the decisions physicians make on a daily basis. This research has real stakes for the medical profession; what if we could evaluate radiologists not based on their anatomical knowledge, but on their ability to determine whether an X-ray shows a fracture? What if we could evaluate dermatologists not just on their conceptual understanding of melanoma, but also on their ability to determine when a mole needs a biopsy? By reproducing a full spectrum of cases, researchers might be able to model the kind of decision making that is directly relevant to practice.   

This post explores whether this type of evaluation might also be possible among lawyers. The bar exam currently tests legal knowledge. What if it could test decision making? When should an undocumented immigrant make an affirmative case for asylum? When should a client facing eviction settle? Should a defendant take a plea deal? Results of tests that (accurately) measured responses to these questions could be used to train and evaluate young lawyers, or even compare lawyers’ abilities across firms. 

Assessing Doctors 

Physicians make countless decisions on a daily basis. An obstetrician decides whether a cesarian section is necessary. A pathologist discerns whether a prostate biopsy is benign or malignant. An emergency physician determines from chest radiographs whether pneumonia is present. 

Some cases are easy, and others involve close calls. Which are which? At scale, it can be hard to know with certainty how each case should be decided, much less where a given physician ranks among their peers. 

But hard to know is different from impossible. We might have 100 physicians evaluate one case to develop a consensus view (or at least a strong minority view) as to how it should be decided. Or we might have a single physician do 100 cases, to understand their particular diagnostic ability. A recent research innovation was to present ~100 physicians with the same carefully chosen ~100 cases.  This methodology leverages the wisdom of crowds to provide a consensus view on the proper outcome in each case as well as a way to measure physicians’ decision-making ability as compared to that of peers.  

In one study, researchers presented 157 dermatologists with 100 images of moles. The task was to rank each image on a scale of one to five based on how likely the lesion was to be cancerous and require a biopsy. The results provided a consensus diagnosis and confidence interval for each case.  

Physicians can, of course, be evaluated based on whether they accurately predicted the results in each case, but this approach adds nuance. For instance, say your dermatologist thought the mole was malignant when 156 of her colleagues thought it was not. If she was right, she might the best in her field. If she was not, you might think about changing practices.  

Either way, the peer comparison methodology provides more information than would a comparison to, say, whether a pictured mole actually did later develop into cancer. The problem with this latter comparison is that predictive accuracy alone does not disclose whether the case was really hard or really easy to diagnose properly. 

It is not hard to see how this peer concept could be broadly applied. Rather than evaluating urological pathologists on their conceptual understanding of prostate cancer, researchers can directly measure their ability to diagnose prostate cancer. Rather than quizzing cardiologists on the symptoms of pericarditis, researchers can now evaluate the quality of their diagnoses. More accurate evaluations of physicians’ abilities leads to better instruction, more accurate diagnoses and treatments, and ultimately better healthcare outcomes for patients. 

Assessing Lawyers 

Could this approach be used to evaluate and train lawyers? As in medicine, this framework works best under specific conditions. First, ideally, the decision should be binary (settle/litigate, file/do not file) to allow for clear statistical comparison across practitioners. Second, all information needed for the decision must be contained in materials that can be easily replicated and independently reviewed. Finally, the decision must occur frequently enough and against consistent criteria, such that practitioners develop pattern recognition skills that can be meaningfully evaluated. 

Public interest law has many situations that meet these criteria. Consider a few examples: 

  1. Asylum Case Strategy: Immigration lawyers must decide whether to recommend filing for affirmative asylum—a binary decision with life-altering consequences. This decision relies entirely on reviewable documents, or at least documents provide enough information for a useful evaluation: completed Form I-589, personal affidavit, country condition reports, and medical records. There are nearly a million asylum applications filed each year, and they are evaluated against a (relatively) consistent statutory criteria under U.S. asylum law. 
  1. Eviction Defense: Housing attorneys regularly decide whether to recommend contesting an eviction or pursuing settlement/negotiated move-out—a yes/no decision. It is possible to base such decisions, at least approximately, on standardized materials including the eviction complaint, lease, payment ledger, termination notice, and state law. With 3.6 million eviction filings annually and relatively uniform legal frameworks (at least within states) these cases create ample opportunity for pattern recognition. 
  1. Plea Bargain Advice: Criminal defense attorneys must recommend whether clients should accept plea offers or proceed to trial—a binary choice based at least approximately on reviewable case files containing police reports, charging documents, evidence summaries, written plea terms, and state law. These decisions occur with high frequency (over 95% of criminal cases resolve via plea). 
  1. Domestic Violence Protective Order Filing: Family lawyers decide whether to file for protection orders based on a standardized set of materials: client statements, police reports, medical records, communication evidence, and state law. These cases arise frequently (over a million annually) and are evaluated against at least some similar statutory criteria across jurisdictions. 
  1. Social Security Disability Claims: Benefit attorneys must determine whether to pursue disability claims—a binary decision based on application forms, medical records, work history, and administrative documents. Over two million cases are brought annually under uniform federal standards

This approach works particularly well in legal aid and public interest contexts because attorneys are handling high volumes of similar cases against relatively static legal frameworks. Lawyers’ decisions in these practice areas recurring patterns that allow for meaningful comparison of decision-making quality across practitioners. 

This methodology could transform legal training and evaluation. Within organizations, rather than the traditional model of case-by-case supervision, new lawyers could review standardized sets of cases, with their judgment assessed against expert consensus. Performance assessment could shift from subjective opinions to objective metrics of decision quality. It could also reveal when non-lawyers—such as accredited representatives, or even artificial intelligence—outperform lawyers at pattern recognition tasks, helping to clarify when legal expertise is essential and when it is not.  

Across organizations, attorney quality could be measured by actual decision-making skill rather than proxies like one’s law school or clerkships, potentially reshaping hiring, promotion, and professional development in public interest law. This model would also be highly valuable to funders: imagine a foundation is looking to allocate $1 million to a legal aid organization. To choose the organization, they could commission a costly, multi-year randomized control trial. Alternatively, they could administer a standardized set of cases and evaluate the quality of the organization’s lawyers against a broader pool—a faster, cheaper, and more directly relevant measure of legal service quality. 

Scroll to Top