Luddites Guide To Data Structure

One of the most important things when starting many Digital Humanities projects is maintaining consistent, well structured data.

One common difference between humanities and STEM, is that humanities isn't limited to repeatable phenomena. Scientific method depends on repeated observations, and repeatable experiment. While humanities' unlimited scope includes unique, or highly complex and historically contingent situations, it can be usefully informed by 'data'. This doesn't necessarily mean reducing humanities or trying to justify humanities by making it more scientific.

Where to Begin?

Think about what types of information about your 'objects' of study need to be recorded and presented, ideally before you begin. Don't let worries about data structure stop you from starting though. Often the structure becomes clear as soon as you start gathering the data, so it's good to make a spreadsheet, try it out on a few examples and adjust. While it's best to avoid late changes to structure so you don't have to go back to the library or the field, you can always add a column if you missed something important. If it is important, or if on the other hand you are trying to gather so much data it's not practical, you'll probably realise early on, so just get started.

How do you make information well structured? Often it's not as complicated as it seems. The simple answer is, "Just put it in a table under column headings."

This is not well structured data:

The Mona Lisa by Leonardo Da Vinci, between 1503 and 1506, maybe 1517. Most famous painting.
Last Supper, 1495 - unknown, Da Vinci. Often referenced in popular culture, this work was...
Michelangelo, c. 1511–1512, Sistine Chapel. Commissioned by...

The artists, painting titles and dates are in different orders, the dates are stored in different ways, and sometimes the name of a single individual is stored differently. The descriptions are just notes and you'll want to edit them later (that's ok, but save yourself some trouble by making it as finished as possible).

This is well structured data:

Painting Artist Start Date End Date Date ExactnessDescription
The Mona Lisa Leonardo Da Vinci 1503 1506 c.The most famous painting in the world, etc.
The Last Supper Leonardo Da Vinci 1495 c.Often referenced in popular culture, this work was...
Sistine Chapel Michelangelo 1511 1512 c.This ceiling decoration was commissioned by, etc.

That's not hard to understand. That's the main point but there's a few more things worth bearing in mind:

Numbers, Dates and Text... and Notes

Software usually handles different kinds of data differently. The main distinctions are numbers, text and dates.

Store numbers as numbers without adding any text to them. Eg: if there is a column for 'Quantity Of Grindstones', don't put 'About 32'. Put '32'. This means we can use those numbers to arrive at (estimated) totals and averages. In humanities we are often dealing with 'data' that isn't measured strictly or consistently as in science. Text can't be added and subtracted so leaving it as a number allows calculations to be made, which you can add any caveats and explanations to later. (eg: a column called 'notes' that says, 'Values are conservative estimates only, based on Emerson's diaries*...')

Text allows for anything at all, but sometimes you want to use it for consistent named categories.

Dates and times are tricky to handle so keep to a consistent format and also don't add extra text to them. Eg: stick to the dd/mm/yyyy HH:MM:SS or some other common format.

Be consistent

Always write the same thing or sort of thing in the same way. Eg: decide if you want to just write 'da Vinci' or 'Leonardo da Vinci' and always write it that way. If you record a date in the format 29/04/2020 don't change to 29 April 2020.

What To Gather?

You may want to break this up differently, specifying whether the first or second date is uncertain, or using only the finishing date if that is all that is relevant, and adding whatever other columns are pertinent. What information you put in depends on:

More Is Better

If you can gather more details do. It's easier to take out subsets of information than for you to revisit every data item.

Don't use MS Word

Avoid MS Word for recording data. Use it for writing letters and essays. Although you can make tables in MS Word, and they are better than just notes, they will ultimately need to be copied to some other format that a computer program can more easily handle. The most commonly used tool, and much easier for a computer to handle, is Excel. If you make columns in Excel you are off to a good start and will save everyone, including yourself, a lot of time and headaches later. This is because Excel files can be saved as .CSV files which are easy for computers and programmers to work with. (Note you can still make a mess of an Excel or .CSV file, just keep all the data broken up in columns with only one type of information in each column)

Structure As You Go

It's easiest to gather your information in the right structure as you go, rather than transcribe it later.

Just Ask

If possible, ask someone what fields (or column headings) are required, or if your data structure is good. If you intend your data to go into a particular system, check what requirements it has. Eg: If you want to put your data into Google maps, for example, even if you're not sure about the technical standards of KML and other acronyms, you can see that you should at least have a 'longitude', 'latitude', 'name' and 'description' for every point you want to plot. If you at least have that in a spreadsheet, it can be converted to the right format.

One Type Of Information, One Column

If types of information can be distinguished, split it up into more columns. Eg:

ArtistPainting
Grace Cossington SmithThe Bridge in Curve (1930)
Katsushika HokusaiThe Great Wave off Kanagawa (1833)

becomes

ArtistPainting TitlePainting Date
Grace Cossington SmithThe Bridge in Curve1930
Katsushika HokusaiThe Great Wave off Kanagawa1833

This structure or that?

There can be a bit of an art to designing structure. Depending on the nature of your research and the data you can find, you might organise it one way or another. Eg: Let's say its about artists and places they are associated with. You could do it like this:

ArtistBirth PlacePlace of Death
Sydney NolanCarlton, MelbourneLondon

or like this

ArtistPlacePlace Relation
Sydney NolanCarlton, Melbournebirthplace
Sydney NolanLondonplace of death
Sydney NolanHeidilived
Sydney NolanBirdsvillephotographed

The first is more suitable if you are only specifically interested in places of birth and death, but would result in too many columns, many with empty data, if you wanted a column for every type of possible place. The second allows for any kind of place associated with the artist, but if possible, the 'Place Relation' should still use consistent categories.

Complex Structures

Information structure can sometimes get a bit complex. Let's say you want to have some extra information about the Artists, such as when each was born and died, whether they were sculptors and/or painters, what cities they worked in, who their patrons were etc. You don't want to add all that information to every row in your table of paintings. You need a seperate table that just stores the artist information once for each artist. You can then relate this back to the painting by the artist's name. This is the structure of a 'relational database'. You can still gather this data in Excel for convenience, but make sure you are consistent in using the artists' names, so that it will match up across tables. Keeping these tables makes it possible to convert the information into a proper database, which can then be used to mix and match, filter and display the data in all manner of ways, including for the web.

A Paradox of Structure and Flexibility

Why structure information this way? It seems rigid and inflexible but well structured data is what enables computers to be flexible. A computer doesn't care if there a few or a million records, if they are structured in the same way it will process them quickly. It can filter and mix and match the information, change formats, run calculations, and pass the data to visualisations. Without a consistent format the computer can only display it the way you put it in - it can't do anything with the data. You lose the ability to manipulate it, and so badly structured data, while flexible in your terms, is inflexible for a computer.

In the badly formatted art information above, the computer has no way of knowing which text it should treat as an artist, which as a painting name and so on. If it is in columns, the computer can treat everything in the first column as an artwork, everything in the second as an artist, and so on. 'Structured data' is part of working with computers as a medium - you don't normally work clay with a paint brush, and you don't normally spin paint on a potter's wheel. To work with computers, use structured data.

So while lots of different systems require different formats, the most important thing is to be consistent and structured. Even if you don't know what specialised formats it might have to be in later, if it is well structured it's much easier to write a small program to convert it all into the right format for any system.

If it's too late and you only have badly structured information, even if it takes hours or days to convert your notes into well structured data, it's a small effort for the benefits of being able to query it to identify relationships, extract subsets for other purposes, generate lists for publications, run it through statistics applications to generate graphs, plot it on a map, make an online gallery, turn it into social network diagrams, display it on the web and whatever other relevant thing a computer can do.