Chat JFK

The National Archives JFK Assassination Records most recent 2025 release of files had me wondering what the entire released collection looked like, so I found the JFK Assassination Bulk Download Files page and took a peek at the 35 or so ZIP files going back to July of 2017.

I had browsed through some of the individual PDF documents back in 2024, and just did some conversion of PDF documents to Markdown for use in indices for vector search by LLM agents. Now my mind raced: what could we learn if we had an LLM searching through this content and taking notes?

Naturally, I arrived at combining these JFK files and the PDF conversion stack I had tied out for great research justice you can learn more about by reading on.

Goal

My goal derived mainly from educational value and discovering interesting connections. To that end, I would retrieve the public domain data from the official source and transform (with the highest possible fidelity) the PDF data into Markdown formatted text data.

I could then use the textual data with LLM agentic workflows or maybe fine tune a new model and call it Chat JFK.

Process

My modest lab for this project consisted of a single Nvidia RTX 3090 Founders Edition GPU with 24GB VRAM combined with some open source software.

A range of open source projects exist that use vision models to assist with converting binary document formats to text format, which in turn becomes easier for an LLM agent to consume for other use cases.

I set out to try these solutions, and find the best fit for the PDF file data I needed to process.

PDF data

I tried a handful of projects, from Docling to Markitdown, but found that for this dataset, the MinerU project provided the most accuracy and capabilities for my lab environment.

I used a single server to fetch the ZIP files, extract them, run MinerU over them, and save the outputs. After some splitting of some folders to account for the strict Hugging Face entries per folder recommendation of less than 10000, I ended up with these folders.

jfk-pdf1        jfk2023c        jfk201804c      jfk20171103
jfk-pdf2        jfk2023d        jfk201804d      jfk20171109-1
jfk-pdf3        jfk2023e        jfk201804e      jfk20171109-2
jfk2021         jfk2023f        jfk201804f      jfk20171215a
jfk2022-1       jfk2025a        jfk201804g      jfk20171215b
jfk2022-2       jfk2025b        jfk201804h      jfk20171215c
jfk2023a        jfk201804a      jfk201804i      jfk2017111710-1
jfk2023b        jfk201804b      jfk201804j      jfk2017111710-2

Within each folder exist sub-folders named for the converted PDF file. For example, the first sub-folder in jfk-pdf1 is 104-10162-10011, named for the original 104-10162-10011.pdf file.

If we peek within that folder with tree:

tree -L 1 jfk-pdf1/104-10162-10011/hybrid_auto

We find a hybrid_auto folder that holds all the artifacts generated by a MinerU conversion.

jfk-pdf1/104-10162-10011/hybrid_auto
├── 104-10162-10011_content_list.json      # Content list metadata
├── 104-10162-10011_content_list_v2.json   # Version 2 content list additions
├── 104-10162-10011_layout.pdf             # Layout analysis highlighted PDF
├── 104-10162-10011.md                     # Markdown of text content
├── 104-10162-10011_middle.json            # Intermediate processing results
├── 104-10162-10011_model.json             # Model inference results
├── 104-10162-10011_origin.pdf             # Original PDF
└── images

The layout analysis PDF shows the highlighting and identification performed by the vision model.

This fragment directly from 104-10162-10011_layout.pdf shows green and red highlighting of layout along with numerical annotation of text elements.

Fragment of analysis PDF file

The images folder holds any image areas of the PDF saved as JPG files.

Unusual files

One thing about data science and data wrangling is that the science can at times be wonky, and in this case a perfect capture of high fidelity information from the source was unfortunately not possible. I made sure to note in the dataset card that the project does not assure 100% coverage of the source material.

Here are some anecdotal examples of a breakdown in the process which resulted in missing at least these 8 files:

  1. 104-10072-10314.pdf
  2. 104-10077-10401.pdf
  3. 104-10120-10100.pdf
  4. 104-10142-10133.pdf
  5. 104-10172-10167.pdf
  6. 176-10036-10101.pdf
  7. 104-10177-10088.pdf
  8. 104-10171-10131.pdf

Oddly enough, MinerU refuses to process these files at all, with an error like this example:

Error: No supported documents found under 104-10072-10314.pdf

After viewing some manually they appear fine, but some further investigation with PDF tools might shed some light on the issue.

Then there was audio

The complete released files includes 12.8 GB of zipped WAV audio across 6 files, named jfk-wav1.zip through jfk-wav6.zip.

At first, I had interest in using the state of the art in open speech-to-text models to take a shot at transcribing the audio files to augment the dataset.

The tests ran with English as the specified language. I tried both Moonshine and Whisper Large-V3 on a file and the results sort of spoke for themselves.

They spoke in copious amounts of at times heavily obscenity-laced and totally incorrect transcription. Some repetition and over-all hallucination occurred as well.

Then I tried actually listening to a couple files. While I expected pretty bad quality, the actual quality borders on worse than bad. At times a 60 Hz hum rides along with what turns out to be mixtures of English, Spanish, and Russian speech degraded to such poor quality that I could hardly make out any words myself.

These audio files would need significant wizardry to de-noise and clarify the sound to the point that one could listen and decipher the audio content long before asking the model to do so.

An experiment for another day? Yes, but not today and not for this project.

2 random MP3s please

Some of the ZIP file contain MP3 audio files mixed with the PDF files.

  • jfk201804i.zip
    • 233-jfk-000443-tape1-s1.mp3
    • 233-jfk-000443-tape2-s2.mp3

The files contain usable audio recorded in 1976, and seem good to transcribe with a speech-to-text (STT) model. I gave that a go, and discovered that these models produce close but still imperfect results.

Check out this brief example output comes from Moonshine streaming-small:

[00:02:20] And?
[00:02:21] And which?
[00:02:23] How long did you live?
[00:02:25] For the company.
[00:02:26] About 15 years.
[00:02:28] All right, sir.
[00:02:30] And in 1963, the first time I saw him,
[00:02:33] Get with your booty station.
[00:02:36] Well, I came here from Washington, D.C.
[00:02:39] Hmm.

Hmm, indeed. “Get with your booty station.” got incorrectly transcribed from “what was your duty station?”

Even these higher quality audio recordings presented some challenges for the open STT models.

This example output from Whisper V3 Large produced far more accurate results:

All right.
And that is of the United States government.
Oh, definitely.
All right.
And at which, how long did you work for the company?
About 15 years.
All right, sir.
And in 1963, what was your duty station?
Well, I came here from Washington, D.C.
It must have been the 6th of June, if I'm not mistaken.

NOTE: The transcribed audio from the MP3 files does not presently appear in the complete dataset. You can check the included SHASUMS.txt for every included file and its SHA 256 hash.

Results

After a week and about a dozen runs on the PDF data, I gradually processed the files after one false start where I failed to keep all the MinerU output. While you might feel tempted to just grab the Markdown files and run, they reference images extracted from the PDF, and the other JSON and PDF data has value for other LLM or ML use cases.

Total post-processing dataset:

  • Files: 1689962
  • Size: 227 GB

I strongly encourage you to preserve all output, and keep in mind that MinerU even keeps (with slight renaming) the original PDF, so you don’t really need to worry about the inputs hanging around.

A wild new Hugging Face dataset appears

After my experiment to convert these files, I put a copy of just the Markdown files on Hugging Face as the Chat JFK dataset.

Final Markdown dataset:

  • Files: 71709
  • Size: 1.56 GB

Check out the dataset card for more details.

The dataset represents raw data without any clean up whatsoever.

While the accuracy of the transcription of text from the PDF files to Markdown comes close to an accurate reproduction in some cases, some examples of hallucination most certainly appear in the dataset.

Always reference and refer the original document to confirm content before using it for anything important.

Use the dataset in creative ways

Build a Query Markup Documents index and search it

If you have a machine with at least 16GB, you can try these simple experiments with the fantastic on-device search from Query Markup Documents (QMD).

After you get the dataset, you can point qmd at it as data source for a collection that you can then search by name in fun ways with your favorite LLM agent.

This example uses the Pi coding agent and the Gemma-4 31B IT model served by llama.cpp on my single 3090.

PROMPT:

Research the jfk collection for locations in Florida and generate a compact
summary of the significant locations and events or persons associated with
them. Write the summary as Markdown in the file JFK-FLORIDA.md and keep it
under 250 lines total.

RESPONSE:

The research into the JFK collection for locations in Florida has been
completed. Significant findings include:

- Tampa: Centered around Santo Trafficante Jr., a powerful LCN figure who had
  connections to the Marcello brothers and played a role in CIA plots
  against Fidel Castro.
- Miami: A hub for Havana-based activity and the site of a major 1967 incident
  at Tamiami Airport, where MIRR leader Orlando Bosch's associates were
  arrested with bombs and napalm intended for ships trading with Cuba.
- Jacksonville: The base for the United Florida Ku Klux Klan (UFKKK), which was
  targeted by FBI counterintelligence efforts to remove leadership like
  Boyd Hamby.
- Ft. Lauderdale: Associated with Jimmy Fazio and his Fireside Steak House,
  as well as LCN-linked activity at nightclubs like Bachelors III.

The summary has been written to JFK-FLORIDA.md.

The complete JFK-FLORIDA.md content:


Significant Locations and Events in Florida (JFK Collection)

Tampa

  • Santo Trafficante Jr.: Born in Tampa; head of the La Cosa Nostra (LCN) family in the Tampa area with control over various criminal activities across Florida.
  • Marcello Connection: Carlos and Joseph Marcello (from New Orleans) visited Santo Trafficante at his Tampa home in early June 1966.
  • CIA Operations: Santo Trafficante is noted as having played an “accompanying role” in the CIA-inspired plots to assassinate Fidel Castro.

Miami

  • Tamiami Airport:
    • On January 15, 1967, Bureau of Customs agents arrested three men preparing to depart with bombs, napalm, and small arms.
    • The event is associated with Orlando Bosch, leader of the MIRR (Movimiento Insurreccional Revolucionario), and a conspiracy to bomb ships trading with Cuba.
  • Havana Connection: Miami served as a primary hub for individuals traveling between Havana, Cuba, and the U.S., including counterrevolutionary contacts and intelligence assets.
  • Informants: Notable informants based in Miami include William Eisenberg (MM T-2) and Joseph Merola (MM T-3).

Jacksonville

  • United Florida Ku Klux Klan (UFKKK): Jacksonville was a primary base for the UFKKK.
  • FBI Counterintelligence: The FBI utilized informants within the Klan to destabilize the organization, including efforts to effect the removal of Boyd Hamby as Grand Dragon of the United Klans of America, Realm of Florida.
  • Other Regional Activity: Related Klan activity and meetings were recorded in Orlando, Plant City, Sebring, Sebastian, and Lakeland (e.g., Forrest Club #11).

Ft. Lauderdale & East Coast

  • Jimmy Fazio: Operated the Fireside Steak House in Ft. Lauderdale. In December 1962, he was informed by a source that “New York ‘boys’” had been seen in Cocoa Beach.
  • Bachelors III Nightclub: Located in Ft. Lauderdale; associated with Bobby Vannuchi and Joe Namath, and site of violence involving LCN figure Jimmy Capatorto.

Assessment: Aside from the copious passive voice and other writing style rule violations, the agent created pretty decent and educationally informative content that could further improve through more rigorous prompt and skill tuning and through the use of more powerful models.

Though one could presume that the accuracy of the resulting content exceeds what a zero shot prompt without the search context, one should remain aware that this perceived accuracy also depends entirely on the quality of the conversion process from the source PDF files.

Let’s try another prompt about a specific individual.

PROMPT:

Research the jfk collection for all references to Frank STURGIS. Write the
summary as Markdown in the file STURGIS.md and reference each document
filename in the summary.

RESPONSE:

I have completed the research on Frank Sturgis within the JFK collection.

 The findings were summarized in STURGIS.md. Key highlights include:
 - Background: A former US Marine (First Raider Battalion) and multi-branch
   veteran trained in unconventional warfare, piloting, and espionage.
 - Intelligence Work: Operated as a triple agent; recruited by the State
   Department/CIA to infiltrate Castro's inner circle. He served as
   security chief for the Cuban Air Force and managed intelligence networks.
 - Anti-Castro Ops: Participated in an anti-Castro leaflet raid over Havana
   (1959) and proposed several assassination plots against Fidel Castro,
   including a crossfire ambush at Campo Libertad.
 - Controversies: FBI records describe him as a "braggart" and "liar," noting
   that some of his paramilitary organizations were mere fronts for personal
   gain. He later became infamous as a Watergate burglar.
 - Strategic Impact: Recruited agents like Marie Lorenz, whose theft of maps
   from Castro's office reportedly assisted the U.S. in locating Soviet missile
   sites during the Cuban Missile Crisis.

 The summary includes references to all relevant documents found during 
 the search.

The complete STURGIS.md content:


Frank Sturgis (Frank Fiorini)

Personal Background

Frank Anthony Sturgis, born in Norfolk, Virginia, also known as Frank Fiorini, Fred Frank Fiorini, Frank Attila, Fred Attila, and Frank Campbell. He had an extensive military background, serving in the U.S. Marine Corps during World War II (First Raider Battalion), and later in the Air Force, Army, and Navy Reserve. His training included judo, knife-fighting, jungle warfare, piloting aircraft, and operating small, fast ships.

Intelligence Career and CIA Involvement

Sturgis operated as a “triple agent” during the Cold War:

  • Early Cuban Contact: Initially joined Fidel Castro’s army in 1957 to spy for Carlos Prio; he became a gunrunner for Castro, purchasing weapons from International Armament Corp (Interarmco).
  • CIA Recruitment: He was recruited by Park F. Woolam of the State Department in Santiago de Cuba and subsequently worked for the CIA.
  • Roles in Cuba: Served as security and intelligence chief for the Cuban Air Force under Major Pedro Diaz Lanz and supervised gambling casinos owned by American mobsters.
  • Espionage Network: Sturgis claimed to have constructed an espionage network in Cuba that remained functional for years. He also recruited other agents, including Marie Lorenz, who stole secret documents and maps from Castro’s headquarters—information that reportedly helped the U.S. identify Soviet missile sites during the Cuban Missile Crisis.

Anti-Castro Activities and Assassination Plots

Sturgis was deeply involved in plots to overthrow foreign governments (including Cuba, Panama, Guatemala, Dominican Republic, and Haiti) and specifically targeted Fidel Castro:

  • Assassination Proposals: He claimed that as early as February 1959, he suggested the assassination of Castro to CIA contacts. Proposed methods included a crossfire ambush at Campo Libertad and planting explosives in Castro’s office.
  • Leaflet Raid: On October 21, 1959, Sturgis served as co-pilot for Pedro Diaz Lanz during an anti-Castro leaflet dropping raid over Havana.
  • Paramilitary Groups: He founded the “International Brigade” and the “International Anti-Communist Brigade” (IACB). However, FBI investigations described him as a “braggart” and a “liar,” characterizing the IACB as a “paper organization” used by Sturgis for personal financial gain.
  • Citizenship: In 1960, he faced expatriation hearings due to his service in the Cuban armed forces; his U.S. citizenship was restored in March 1961.
  • Watergate: He later gained widespread notoriety as one of the burglars in the Watergate break-in.
  • Public Claims: In a 1975 interview with the Miami Herald, Sturgis admitted to being a CIA operative from 1959 to 1968 and discussed his involvement in numerous assassination plots and conspiracies.

Referenced Documents

  • jfk20171109-1/104-10096-10131.md (FBI background, aliases, and early activities)
  • jfk2022-2/104-10221-10314.md (Miami Herald interview regarding CIA and assassination plots)
  • jfk2025b/135-10001-10288.md (JFK Assassination System reference)
  • jfk2022-2/docid-32105623.md (Agency file regarding testimony and tapes)
  • jfk2023e/104-10221-10039.md (Personnel file starting Jan ‘69)
  • jfk201804f/104-10256-10286.md (FBI report on contact via State Department)
  • jfk20171109-1/104-10048-10438.md (Identification and birth records)
  • jfk20171109-2/104-10221-10189.md (DOJ reference to aliases)

For one last example and idea with qmd again doing the search, a prompt for “what happened on this day?” style retrieval of events.

PROMPT:

Research the JFK collection for recorded events that relate to the date
June 17 and summarize what you find with document references in Markdown
format, then save in the file JUNE17.md

RESPONSE:


JFK Collection Research: June 17

Research into the JFK collection identified the following recorded events relating to the date June 17:

June 17, 1963

  • Gambling Raid Legal Proceedings: On the morning of June 17, 1963, all principals arrested in a gambling raid appeared before Trial Magistrate William Lokar of St. Marys County, Maryland. The hearings were subsequently postponed until the following morning, June 18.
  • Grand Jury Reconvened: On the same day (June 17, 1963), Circuit Court Judge Phillip H. Dorsey reconvened the Grand Jury for the Circuit Court of St. Marys County, Maryland, to investigate all aspects of the gambling matter.

Document Reference:

  • qmd://jfk/jfk2017111710-2/docid-32331412.md (DocID #332745)

June 17, 1972

  • AMSNAP-3 Relations: A memorandum authored by Clement E. Chen reviewed “Significant Factors Covering Station Relations with AMSNAP-3” specifically in the light of events that occurred on June 17, 1972.

Document Reference:

  • qmd://jfk/jfk2022-1/104-10096-10131.md (DocID #65389e)

The output lists the PDF document IDs so you can cross reference with the original PDF file.

Summary

Rapid advances in local model and tooling technologies enable a surprising range of capabilities for even casual researchers or data scientists to examine the historical record with tools we could not even imagine just a decade before.

Running such experiments in a modest local lab represents a massive shift in how people can begin to use LLMs with information regardless of its source format.