中国法院判决的大规模数字化:在中国法律领域如何使用数据文本
本文来源:吉大司法数据 翻译:王怡聪
Mass Digitization of Chinese Court Decisions:
How to Use Text as Data in the Field of Chinese Law
Over the past five years, Chinese courts have placed tens of millions of court judgments online. We analyze the promise and pitfalls of using this remarkable new data source through the construction and examination of a dataset of 1,058,990 documents from Henan province. Courts posted judgments in roughly half of all cases in 2014 and, although the percent of cases posted online has likely risen since then, the single greatest challenge facing researchers remains documenting gaps in the data. We find that missing data varies widely by court, and that intermediate courts disclose significantly more documents than basic level courts. But court level, GDP per capita, population, and mediation rates are insufficient fully to explain variation in disclosure rates. Further work is needed to better understand how resources and incentives might be skewing the data. Despite incomplete information, however, a topic model of 20,321 administrative court judgments demonstrates how mass digitization of court decisions opens a new window into the practice of everyday law in China. Unsupervised machine learning combined with close reading of selected cases reveals surprising trends in administrative disputes as well as important research questions. Taken together, our findings suggest a need for humility and methodological pluralism among scholars seeking to use large-scale data from Chinese courts. The vast amount of incomplete data now available may frustrate attempts to find quick answers to existing questions, but the data excel at opening new pathways for research and at adding nuance to existing assumptions about the role of courts in Chinese society.
Keywords: Data; Law; Chinese Courts; Court Cases; Text as Data; Court Judgements
摘译
在过去的五年里,中国法院在网上公开了数千万份法院裁判。通过建立并检验河南省的1058,990份文件,我们分析了对于运用这一标志性新数据源的承诺,以及使用该数据源的缺陷。2014年,法院仅公开了大约一半的判决。尽管从那时起,网上公布的案件比例似乎在上升,但研究人员面临的最大挑战仍然是文件制作的漏洞。我们发现,未上传的数据在不同法庭间有很大差异(不同法院间对数据的隐瞒有很大差异),而中级法院披露的文件比基层法院要多得多。但是,法院层面、人均GDP、人口以及调解率并不足以解释信息披露率的差异。为了更好地理解资源和激励措施是如何影响数据,我们需要进一步的工作。然而,除了信息的不完整外,20321份以行政法院判决为内容的主题模型,显示了大规模数字化的法院判决为中国法律的日常实践打开了一扇新的窗口。无监督的机上学习以及近距离阅读经过选择的案例的密切结合,揭示了行政纠纷令人惊讶的趋势以及一些重要的研究问题。综合来看,我们的研究结果表明,寻求从中国法院获得大规模数据的学者需要更加谦逊的态度以及多元的方法论指导。目前可用的大量不完整的数据可能会阻碍人们对现有问题的快速回答,但这些数据在开辟新的研究途径方面表现得较为出色,并在向现有的关于法院在中国社会之角色的假设中添加细微差别方面也发挥着积极作用。