Share

The Future of Document Data: Could Law Technology Lead the Way for Once

blank

by Nathan Morris

on 23 January, 2015

A while ago we wrote why legal professionals should take a long, hard look at Dropbox. Though perhaps the most dire problem facing attorneys who use the service is lack of security, we allowed ourselves to explore the system’s other limitations, and to imagine what better options could look like for the legal world.

We want to delve a little deeper into the issue of keeping files organized — because although the collaborative nature of Dropbox adds its own challenges to this issue, the problems of document organization aren’t unique to that system.

The Ghosts of Filing Cabinets Past: How We Structure Document Storage.

Though we’re in the thick of the Information Revolution, we still carry the marks of our past. For all our forward thinking, older methods of organizing information remain with us like a tailbone or wisdom teeth, to remind us where we came from.

All of our document systems have as their great-grandparent the venerable old filing cabinet — and the family resemblance still shows. Because the filing cabinet served us faithfully for centuries, in the 1970s when UNIX-based systems popularized storing virtual documents in a structured way, they didn’t deviate too far from document storage’s grey, metal predecessor.

The familiarity of this older approach undoubtedly made us more willing to take it up. And so even today, we find our digital documents, notes, and correspondence by looking inside the digital equivalent of the files and records room (often with little pixelated manilla folders to help us orient ourselves). As in the filing cabinet, to find what we need, we search for the general folder, then the subfolder, and on down to the title of the document itself, in a hierarchical classification system.

Even web pages are files in a folder on a computer somewhere out there. When you type in www.sitename.com, you’re just connecting to a server which can send you a file out of one of its folders set aside for visitors. Domain names are just a trick to help us find what is otherwise an eminently forgettable numerical address on a computer network. This means we can type in www.google.com rather than needing to remember a line of numbers likehttp://74.125.224.72/.

But the way we find those files on the internet now points to where the filing-cabinet analogy has been supplemented by something more useful. The number of web pages on the internet is around 60 trillion and growing. If we relied on older methods of information organization to find the needle we need in this haystack, we would be hopelessly lost. We would have to remember or have a personal directory of all the folder and subfolders we want to use, an absurdity that calls to mind Jorge Luis Borges’s short story about a map of the empire so detailed it eventually became the size of the empire.

So we google it instead.  Think of how revolutionary that is.

Google’s ridiculously large index of the internet (over 100 million gigabytes) isn’t a static map or directory, but rather is supported by mind-bendingly complex algorithms which allow us to find what we want based on our chosen keywords. They do it in under ? of a second, 100 billion times a month. This is how they got the big bucks — and earned the honor of having their name turn into a verb.

we’re reaching our cognitive limits at organizing and retrieving what we need. The result isn’t pretty.

It’s time that document storage on our own devices had an analogous revolution. We don’t have 60 trillion documents on our laptop, but we typically have more than we can handle. We don’t need the superalgorithms of Google, but we can’t pretend anymore that we’re doing fine using the old hierarchical classification system of folders, subfolders, and files. As storage capacity has grown, as our lives become increasingly paperless, and as our devices increasingly store other media which used to be held in spaces like photo albums and record collections, there are signs we’re reaching our cognitive limits at organizing and retrieving what we need. The result isn’t pretty.

Researchers at MIT suggest that our systems are needlessly reiterating the limitations of their physical counterpart, the filing cabinet. “One such problem,” they write, “is the inability to conveniently file documents in more than one category.” They point to the “dual role” of the filing cabinet, which our current document manage systems reproduce: we expect the cabinet to both store paperwork and also organize it. “Moving towards the computer version,” they state, “we find that this duality makes little sense.”

But there are reasons the file-folder system has persisted into the digital age. We’ll explore what aspects of the hierarchical folder-filing system are still useful to us, investigate what makes it break down, and envision what a supplementary, tagging-based system could do for legal professionals.

How the File/Folder Hierarchy Works:

When humans have to find something, we do best when it’s grouped with like items, in a group that is itself near like groups. If we’re in a particular department store for the first time and we’re looking for a women’s size-9 loafer, we’re pretty confident we can first find the footwear section, and it will be divided into men’s, women’s, and children’s styles, each of which will be subdivided into types and sizes.

This top-down organization system creates a taxonomy, with one correct spot for each item. This is the structure of our biological classification systems — we file the species ‘house cat’ away in the Feline folder, which is in the Carnivore folder, in the Mammal folder, in the Vertebrate folder, in the Animal folder, in the mega-folder of Life itself. Putting ‘cat’ anywhere else would be incorrect, like putting loafers in the dairy aisle.

when we place a file in a folder, we’re trying to prophesy what categories will be relevant to us in the future

This also means that if scientists make the discovery that house cats are in fact a kind of rodent, then the entire thing would need to be reorganized and the textbooks rewritten. Hierarchical categorization, which is slow to change and rooted in the culture that created it, works really well for situations where such revolutions are unlikely, and where the “correct” spot is indicative of a deeper meaning —  like how we expect biological taxonomies to show evolutionary relationships in the tree of life.

Similarly, when we organize our files in a hierarchy, we’re creating our own tree. Each file is a leaf, connected to only one twig (its folder), which is in turn connected to only one branch, on down to the trunk of the entire storage system.

How it Fails:

Chris Harrison, Assistant Professor of Human-Computer Interaction at Carnegie Mellon University and director of the Future Interfaces Group, investigated what we need to do in order to keep a hierarchical organization system working. He notes that users have to name things clearly — which becomes especially important (and difficult) when we’re trying to keep track of different versions of the same document. But most importantly:

“users must be diligent and spend time to organize documents appropriately. If documents are not organized, or worse, incorrectly arranged, the system can become more unwieldy than a flat file system. Also, as directories become saturated with files, users create sub-folders to partition documents into smaller and more manageable sets. As users create and acquire additional files, maintaining and navigating these increasingly deep organizational structures becomes complicated and time-consuming.”

Though biologists might find it worth their time to find the correct spot for each species, we’re not so motivated when it comes to our files. Research by Jody Foo in human cognition and computer science reports that “many people find the task of filing documents ‘tidily’ takes more time than it saves.”

File this under L for Lame.Do you file this under H for “horse” or R for “rider?” How do you find it in the future?

One reason this might be the case is because we recognize that for many files there isn’t a clear, correct choice. Our documents aren’t part of some organic tree of life: sometimes they are more like monstrous hybrids. This may be one reason we dump so many of them on our computer’s “desktop,”which is simply a particularly visible folder. To understand what’s going on when we do this, computer scientists at MIT (Quan, Bakshi, Huynh, and Karger) looked at older research on messy desktops — when they were actual desktops. Then as now, one major reason people were driven to create stacks of uncategorized papers is because they couldn’t choose between several potentially overlapping categories to file them under.

Foo gives as an example of this quandary a photographer who regularly takes pictures of horses and of people, with a folder for each type. When they have a photo of a person on a horse, they have to make a choice without a correct answer. The MIT team notes that when we place a file in a folder, we’re trying to prophesy what categories will be relevant to us in the future (in our example: will we think the horse or the human is more relevant at an unknown future date when we’ll want this again?). At that time, we’ll have to “remember the ordered sequence of topics and subtopics that were used to organize the information when attempting to retrieve it, even though the topics of interest during retrieval might be different from those during organization.” Maybe the person on the horse later became famous, but we can’t find a picture of her now because the horse interested us more at the time.

When our file taxonomy breaks down, either because we’ve categorized things in an order that is no longer relevant to us, or because we tried to avoid categorizing them to begin with, we try to do a search. That’s when our clear naming of files becomes particularly important. This is such a common occurrence, that some of us create painfully long file names to try to include all the terms we expect we’ll search for in the future. If we haven’t named it well, our choices are to dig through the files or give up — particularly if it’s an image with one of those horrible automatically-generated names (I just opened my picture file and found one was melodiously titled “734230_583725334974780_1193488710_n.jpg”).

This is why the team at MIT warns: “the success of Internet search engines such as Google may suggest to some that search alone may solve most retrieval problems. We argue that this is not the case. Searching is of little use when the precise details of the target documents are not easily recalled.” Since our taxonomies are tied to our unique, deeply subjective way of organizing our worlds of information, when they fail, computers can’t always help us.

Tagging the Chaos:

But one thing computers are fantastic at is searching for key words or numbers, at speeds impossible for humans.

This is one reason for the success of companies like Amazon, who outright dumped the taxonomic system in their warehouses. Their women’s loafers aren’t in the shoe section, because there is no shoe section. There isn’t even a general ‘clothing’ section. Once they pop a unique barcode on each item, they can shelve it randomly (the system’s actually called ‘chaotic storage’) and rely on the computers tell them how to find it later.

The International Business Times notes: “For a company like Amazon, which has thousands if not millions of different items to keep track of, it would take a significant amount of time just to stock the goods in an organized fashion, and that’s before an order even comes in. The staff can cover more ground using Amazon’s system, and they don’t need to spend time sorting items by product or shipping volume.” Again, though we don’t have as many documents as Amazon has knit sweaters, we also come up against the limits of proper storage. As a warehouse industry blog reminds us: “The term ‘chaotic storage’ is by the way only justified from a human point of view, but is not at all correct from the standpoint of a computer. For a warehouse management software, a chaotic storage system is nothing more than a sequence of calculations and database operations.”

No one’s going to try to convince people — especially attorneys — to begin storing their precious files “chaotically.” But we could adopt the “barcode” idea to better take advantage of a computer’s searching strength. This is essentially what happens when we ‘tag’ an item — we put in keywords that the computer can almost instantly retrieve when we need them, no matter where we’ve squirreled it away.

The main strength of tagging is that it’s a bottom-up, open-ended structure that has an infinite variety of correct classifications, instead of only one. Each category is inclusive, instead of the mutually exclusive categories in taxonomies (these cats could be tagged both “carnivore” and “rodent,” if we felt like it). The system is open to rapid change, because each change can be made independently, without altering the structure of the others. If we want to continue with a biological metaphor: instead of a tree, this is structured like a mass of interconnected neurons.

This new mindset in organizing and navigating through our documents, according to Quan and others at MIT, “allows information to be placed in multiple thematic ‘bins’, or categories, simultaneously. Allowing multiple categories lets the user organize documents in a more intuitive, richer information space and supports our belief that information inherently has multiple, relevant categories that the user can readily (albeit subjectively) identify.”

Imagine if each document was tagged not only with the names of its parent folders, but also with concepts like the name of the client, the recipient of the document, the subject of the document, and even the importance of the document, or how you feel about it.

While the filing cabinet both stores and organizes material, Jody Foo notes that having a variety of categorizations effectively “decouples document organization and document storage. The location of a document is not the same as in which grouping(s) it is found. In this way, a document can belong to several organizational groupings (which can be hierarchical) without needing to be duplicated.”

And although the use of tags relies on the strengths of the computer, there’s something highly intuitive about it as well. The MIT researchers note: “Compared to the folder paradigm, multiple categorization not only improves organization and retrieval times but also matches more closely with the way users naturally think about organizing their information.”

They also provide an important caution: in order for a tagging system to work, it needs to bepervasive in the system. This is perhaps the biggest limitation to the tagging system today: we’re so accustomed to the (digital) manilla folders and filing cabinet, that it would require a paradigm shift to consciously label a document with its relevant keywords. Failure to do this with even one document would begin to introduce inefficiencies and stress in the system.

The only solution going forward, then, is to begin implementing this tagging mentality in discrete systems such as those we use at our individual offices. For example, in your digital case files, a tagging and search system could revolutionize your at-work productivity. Imagine if each document was tagged not only with the names of its parent folders, but also with concepts like the name of the client, the recipient of the document, the subject of the document, and even the importance of the document, or how you feel about it. A simple search for the recipient name would bring up every document ever sent to that person. A search for “wow” would bring up all your best work for future reference.

In essence, you could store a single document to every folder you can imagine needing, instead of just in a client file, or worse, a catch-all repository for unsorted documents.

A Synthesis for Legal Professionals

Most attorneys wouldn’t want to toss their taxonomies, no matter how cluttered they’ve become. We want to enter the folder for a specific case, and know that all relevant information and subfolders for that case is there, in a stable space. We don’t want to entrust all of that to an Amazon-style “chaos.”

Researchers in computer science recognize this is the desire of most computer users. The MIT team concludes “Of course the ideal system would bring to bear the best features of each paradigm.” We want everything correctly-stored — we just want rich metadata capabilities along with it, rather than an inert stack of files.

Some software, like Evernote, already support a limited use of tags. There are even ways to add metadata to Microsoft Office files. As our minds become more accustomed to thinking of file organization and retrieval through this paradigm, we can only expect these features to become more pervasive and easy to use.

But what about the difficulty in getting started? With no universal standard for tagging, how do we begin?

This is where forward-thinking developers in the legal tech world might actually be leading the way for once.

if you tagged your favorite work with some sort of “wow” tag, it could be found in moments, even if you don’t remember anything about it.

One possibility is the automatic generation of tags through our case management software. Because of how attorneys file documents, there’s an opportunity to develop automatic tagging systems that could be implemented in other technologies. For instance, if you save a file into a client’s case file, the system could automatically add certain words as tags, such as the name of the client, the type of case, the date, and the kind of document.

Currently, if you’re trying to find that particularly good “motion to dismiss” you wrote several years ago, you have to rifle through your case files hoping to recognize the name of that particular client — or else type “motion to dismiss” into the search bar and get every one you’ve ever written. With automatic smart tags, even without having carefully guessed your future requirements at the time of generating the document, when you needed it you could type “car accident, 2015, motion to dismiss” and have a significantly smaller set of documents to look through. Heck, if you manually tagged your favorite work with some sort of “wow” tag, it could be found in moments, even if you don’t remember anything about it.

Implementing tagging for document retrieval purposes on a global scale is clearly impossible. But it can be done on a discrete level, without risking data falling out of our control. By tagging in our personal and business systems, our search tools can find the right document, with the right information, without needing to have to actually read the data like search engines currently do with web sites.

Our minds might be incredible things, but they have their limits. We embraced the technology of the filing cabinet to help us expand those limits — but it might be time we update the dusty tech we inherited from offices past and begin to radically reconsider how we organize and find what we need.